Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1

2004-03-01 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

From: Markus Kuhn [EMAIL PROTECTED]

to the left, not one *cell*. I know that this is not what backspace does
in some EUC terminal emulators, but I believe a strong case can be made


A correction.  Not *some* EUC terminal emulators, but *every* EUC
terminal emulators.  Do you know *any* example which is popular
in CJK world and on which a 0x08 moves two columns on a doublewidth
character?
 Sure, every one of Korean emulators (for EUC-KR and Johab) I have used 
moves two column-widths (a single Korean character) for 'backspace'.
I was rather surprised to know that Japanese terminal emulators don't.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Canonical Mode Input Processing with multi-byte character sets

2004-02-24 Thread Jungshik Shin



On Tue, 24 Feb 2004, Derek Martin wrote:

Hi Derek,

 On Tue, Feb 24, 2004 at 08:43:09PM +0900, Jungshik Shin wrote:
Please, read what I wrote more carefully. I did write that deleting
  the last letter is more useful when you're in the middle of typing a
  sequence of letter to form a syllable.

 I think we're talking past eachother here...  I noted that and I agree
 with it.  It's specifically the fact that once I type the third
 character of a hangeul glyph, I can't backspace and change ONLY that
 last character, that annoys me.  You say that most Koreans prefer that
 behavior, and I believe you.  But I can't for the life of me
 understand why...  ;-)  To me, it seems unnatural and inefficient.

 Sorry for my misunderstanding. As you may know by now, The Korean script
has several different facets. It's alphabetic, syllabic and featural
all at the same time. Therefore, different implementations at different
times on different platforms take different approaches when it comes to
representing and processing the Korean script on computer. Because you
live in Korea now, you must have seen the keypad of Korean mobile phones
and may have learned how to type Korean.  It uses three keys for vowels
and 6 keys for consonants. See how consonants are grouped and you may
understand why the Korean script is featural.

 Almost invariably once I've committed an erroneous syllable, it's not
 the whole syllable I need to replace, but only the last character
 which I flubbed.  Otherwise, if I made a mistake before the syllable

  Anyway, I understand where you're coming from. Your complaint
is perfectly valid. What you want can and must be implemented Actually,
Nabi may already have implemented it because its input automata is based
on U+1100 Hangul Jamos. In addition, I have the same complaint about
the most popular Korean mobile phone keypad. It takes a lot more key
storkes to enter a single syllable and it's annoying to find 'backspace'
delete the whole syllable instead of the last letter typed. However,
9th graders on the street don't seem to have a problem at all because
they can type Korean so fast with the keypad that having to enter a
syllable from the beginning doesn't appear to matter to them.
So, I guess your problem would go away as you get more familiar with your
Korean keyboard and input method.

  However, incremental search needs to be done with individual letters
  as unit instead of syllables. I think Indian people have similar
  needs.

  Incremental search with letters as units was implemented
in only one  program (Korean Emacs : Hanemacs by KIM Kang-hee) as far
as I know.  It would be great if it's implemented in Mozilla's 'type as
you find'.


  LANG=en_US.UTF-8  (or en_GB.UTF-8, en_CA.UTF-8)
  LC_CTYPE=ko_KR.UTF-8
  LC_MESSAGES=en_US.UTF-8 # not necessary unless LC_ALL is set, but
  LC_TIME=en_US.UTF-8 # just to be sure.
  ---


   # .profile (or whatever)
   LANG=en_US.UTF-8
   LC_COLLATE=C  # I like ASCII sorting for most applications...
   ...
   export LANG LC_COLLATE ...

 Then, when I start up an application where I want to type Korean, I
 originally tried startiing it like this:

   $ LANG=ko_KR.UTF-8 LC_COLLATE=ko_KR.UTF-8 LC_MESSAGES=en_US.UTF-8 gedit

 2. Hangeul input via ami simply didn't work.

  There's one missing piece here. Sorry I forgot to tell you. You have
to set XMODIFIERS to '@im=Ami'. If you log on with the Korean locale
selected in KDM/GDM, this variable is automatically set for you on
most Linux distributions. However, apparently you don't so that you have
to set it manually.


 1. Menus were in Korean

  Really? Hmm, you may have set 'LINGUA' or something like
that (non-standard GNU extension) set to Korean. Make sure it's unset.


 As it happens, until recently the most common case I want to do this
 was with mozilla.  It wasn't a major problem then, because my
 installation of Mozilla had no Korean.  But as my Korean improves, I
 have more and more cases where I want to do this.  Of course, I'm also
 better able to navigate the menus, but that's beside the point...  :)

  Actually, Mozilla language packs work independently of the locale. No
matter what your locale is, you can have Mozilla's menu in any
language for which you have installed the language pack.  However,
Ami works with Mozilla only if Mozilla is launched with LC_CTYPE (or
equivalent) set to ko_KR.UTF-8/ko_KR.EUC-KR. BTW, it should be fixed
to work with any UTF-8 locales. Hmm, I'm gonna add it to the TODO list.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Does Hotmail support UTF-8 emails properly?

2004-02-01 Thread Jungshik Shin
Richard Jones wrote :
On Sun, Feb 01, 2004 at 05:35:04AM +0900, Jungshik Shin wrote:

ASCII  are compatible).  For your mail-sending web form, why don't you 
send an email to yourself and view it with mail clients that are well  
I18Nized such as Mozilla-Mail, Mozilla Thunderbird and  MS Outlook Express?


Unfortunately Hotmail is what the majority of the target audience use.
I've now changed the script so that it uses iconv to convert
everything to ISO-2022-JP before sending, and now it works OK in
Hotmail.
 That's unfortunate, indeed. However, it's not that bad if your 
recipients are all Japanese and they don't need to receive non-Japanese 
emails. BTW, I mentioned Mozilla/MS OE as a way to make sure that your 
mail-sending form works correctly because you were not sure that it 
worked correctly.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Linux console UTF-8 by default

2004-01-10 Thread Jungshik Shin
Edward H. Trager wrote:

On Saturday 2004.01.10 20:48:31 +0330, Roozbeh Pournader wrote:
 

On Sat, 2004-01-10 at 20:36, Edward H. Trager wrote:
   

Is there any good reason why implementors would not support the
full range of Unicode -- i.e., UTF-8 up to six serialized bytes?
 

UTF-8 up to four bytes, you mean. See
http://www.faqs.org/rfcs/rfc3629.html.
   

I guess I was recalling (from http://www.cl.cam.ac.uk/~mgk25/unicode.html) 
that six bytes allows encoding all possible 
2^31 UCS code points, although
I suppose nothing above plane 1 has been defined.  - Ed Trager
 

Plane 2 has tens of thousands of  Chinese characters and Plane 14 has 
variation selectors and language tags. However, nothing will ever be 
defined above Plane 16. JTC1/SC2/WG2 made a firm commitment to that.

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: devanagari question

2004-01-02 Thread Jungshik Shin
On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

  If you yearn for the old days

 You seem to have a very slow mind.

  I don't know whose mind is slow. I gave all the necessary information
and you couldn't still make it work. Here's one more try with a
step-by-step instruction (actually, there's not much to tell you because
you must have taken  most of these steps)

 1. download Sun Indic fonts, which you already did.

 2. Put them (there are two of them) into a directory of your choice
(say, /usr/local/share/fonts), which you must have done already.

 3. Edit /etc/fonts/local.conf or $HOME/.fonts.conf
and add the directory above to the font search path.

You can skip this step if you throw fonts into
one of directories or its subdirectory already listed in
/etc/fonts/fonts.conf, /etc/fonts/local.conf and  $HOME/.fonts.conf
like /usr/share/fonts or /usr/share/fonts/indic

 3b. although not necessary (because fontconfig
 scans font directories regularly), run the following, if you
 want to make sure.

   fc-cache -v -f directory_name


 4. Lanuch Mozilla (built with CTL and Xft) and enjoy.  Your web page
was written in such a way that no further configuration is necessary
on Mozilla's side.

 5. _Optionally_, go to font pref. panel of Mozilla and set Devanagari fonts to
Sun's fonts. Also make sure 'allow documents to use other fonts'
is NOT checked. This is necessary for viewing other Hindi pages.
Because most other Hindi sites don't specify 'lang=hi' [1], you have
to launch Mozilla under hi_IN locale (i.e.
'LC_ALL=hi_IN.UTF-8 mozilla') [2]

For X11core build (with CTL but NOT with Xft), you have to follow the
step (which can be simplified slightly with chkfontpath available on
FC1/RH/Mandrake) described at (or equivalent

http://bugzilla.mozilla.org/show_bug.cgi?id=176315#c14

(The last two fields of XLFD for Sun Indic fonts should be
'sun.unicode.india-0' instead of  'hykoreanjamo-1'). See also

   http://bugs.xfree86.org/show_bug.cgi?id=939

With the encoding file for Sun Indic fonts, you don't need
to make aliases.

If you want to use 'standard' opentype fonts for Devanagari, you can
try the latest (but still old/outdated) patch
at http://bugzilla.mozilla.org/show_bug.cgi?id=215219

[1] BBC Hindi site will begin to use 'lang=hi' in a couple of weeks.
[2] You don't have to once Mozilla bug 208479 is fixed.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2004-01-02 Thread Jungshik Shin
On Sat, 3 Jan 2004, Jungshik Shin wrote:

 On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

   If you yearn for the old days
 
  You seem to have a very slow mind.

   I don't know whose mind is slow. I gave all the necessary information
 and you couldn't still make it work. Here's one more try with a

  I'm sorry I forgot that I always had built Mozilla with a patch
that went into the trunk only a few days ago. That patch was made so
long time ago (and it's only necessary for Devanagari but not for Tamil)
that it was taken for granted by me, but it was not in the tree until
a few days ago. The patch to apply (you only need to apply the patch
if you download 1.6b release source instead of the CVS trunk source)
is available at http://bugzilla.mozilla.org/show_bug.cgi?id=203406
(the last patch uploaded there).

  BTW, X11core build doesn't need this patch to work although with the
patch, it works better.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-31 Thread Jungshik Shin

On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

 Good. So no need to worry about the html page.

 Actually, there is. By 'sun_devanagair_font', I didn't mean that
you use that verbatim but that you have to replace that name by the
actual name of Sun's font. Besides, it's always a good practice to put
one of five CSS generic font families (serif, sans-serif, etc) at the
end of your font list  as I wrote.


 Remains to worry about Mozilla and/or the X server and/or fontconfig.

  Xserver does only little part in the equation as long as it supports
Render extension. Did you put your Sun's Saraswati fonts (two of them)
in one of directories looked into by fontconfig?

 things work. Am quite prepared to use cryptic names like
 -altsys-saraswati5-medium-r-normal--0-0-0-0-p-0-iso10646-1

  Well, with that XLFD name, Mozilla (X11core build) wouldn't
recognize it as a SunIndic font so that Devanagari wouldn't get rendered
as it should. You have to alias it so that the last two field of XLFD is
sun.unicode.india-0 (or something like that) by editing fonts.alias file
and some other chores involved in the X11 font installation.  That's one
of reasons I told you to use an Xft build.


 but you seem to imply that life is simpler today. Not yet for me.

  If you yearn for the old days of XLFD, X11core fonts and
mkfontdir/mkfontscale/xset fp/chkfont/xfs/fonts.dir/fonts.alias/
fonts.scale etc, you can stay there by continuing to use a non-Xft
(X11core) build of Mozilla. However, for the increasing number of programs
in modern Linux distributions, you won't have a choice soon when gtk2
stops honoring GDK_USE_XFT=0.

 [Answering my own question from yesterday night - the new Mozilla build
 shows as possible font choices things in the output of fc-list on the
 client.]

Where have you been during the client-side font revolution? On Mars ;-) ?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-30 Thread Jungshik Shin

On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

 [Installed Fedora 1 on a spare machine - compiled Mozilla 1.6b
 after ./configure --enable-ctl --enable-xft . It runs fine (*), but
 doesnt show what I expect to see.]

 Let me repeat my question, this time referring to
 http://homepages.cwi.nl/~aeb/moz/test.html

It works fine on my machine with SunIndic truetype fonts installed.
The string there is rendered exactly like the image below.

 [Apart from the obvious Mozilla bugs, there is a change in behaviour.
 The old build showed in Edit/preferences/appearance/fonts actual font
 names, the new build shows font family names. The font names were
 very recognizable: just the output of xlsfonts. These font family
 names have an origin unclear to me. Mozilla does not run on the
 X server, but the X server has the fonts, maybe there is a problem there?]

Not at all.  As I explained at least two times on this list, there are
two flavors of Mozilla-builds, X11core build and Xft (client-side font)
build. The latter does NOT use 20-year old (broken) XLFD based font
selection scheme any more. The font selection in Xft build works more
like that on Windows and MacOS (and more in line with CSS). You don't
think end-users have to care for seeing all those (cryptic to them)
'iso8859-1', 'iso10646-1', 'jis0208.1980-0' and things like that, do you?

Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-29 Thread Jungshik Shin

On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote:

 but I tried compiling on a Debian (Woody) and on a RedHat (7.2) machine.
 In both cases Mozilla-1.6b.

 For Debian the compiled binary does not run. Errors are like reported:
  ./mozilla-bin: relocation error:
  mozilla/dist/bin/components/libgfx_gtk.so: undefined symbol:
  GetContent__C8nsIFrame

  Obviously, I can't possibly know what's wrong with your Debian
build environment (linker, compiler, etc) :-) Why don't you post to
netscape.public.mozilla.unix newsgroup at news.mozilla.org with
details including the output of 'nm'?


 For RedHat the version compiled with --enable-ctl runs, but still
 does not handle devanagari.

 Did you install Sun's fonts? It only works with Sun's fonts I
mentioned if it's not clear from my post and i18n rel. notes.  Although
there's a way to make it work with a non-Xft build (I wouldn't explain
it to you), I'd recommend you build with 'enable-xft'.


 [On the other hand, adding --enable-xft fails (on Debian):
  checking for xft... Package xft was not found in the pkg-config search path.


  Your Debian seems pretty much outdated as far as Xft/fontconfig is
concerned.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-29 Thread Jungshik Shin
On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote:

 [A week or so ago I wrote a multilingual text, and several
 languages failed under default Mozilla. If we succeed in
 getting a version that handles devanagari then a next point

You have to make sure to tag the Devanagari part with 'lang=hi-IN'
for html and 'xml:lang=hi-IN lang=hi-IN' for xhtml (if it's Hindi).
That is, you have to do something like this for Xhtml.

p lang=hi-IN xml:lang=hi-IN
...
/p

div lang=hi-IN xml:lang=hi-IN
...
/div

html lang=hi-IN xml:lang=hi-IN
...
/html
body lang=hi-IN xml:lang=hi-IN
...
/html

You may also 'style' Devanagari parts with the following style:

font-family: sun_devanagari_font,
 default_devanagari_font_on_Windows,
 default_devanagari_font_on_Mac,
 some_free_devanagari_opentype_fonts,
 generic_css_family

The reason you have to put 'sun_devanagari_font' at the beginning
is that 'sun_devanagari_font' is not likely to be installed
on most Windows/Mac OS X  so that it doesn't do any harm
while for Mozilla-Linux, it's essential that it's picked up
_before_ other Devanagari likely to be installed on Linux.

Certainly, things should be easier than this, but that's where Mozilla
stands at the moment.


 for discussion will be vocalized Hebrew. For now the first

  It's not likely to work yet because vocalized Hebrew involves
combining marks (right?).

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-29 Thread Jungshik Shin
[EMAIL PROTECTED] wrote:
 Jungshik wrote:
 lots of good advice
 
 Thanks !

You're welcome.

 However, I will not pursue this further. Have no time.
 For the time being it seems this is something where Internet Explorer
 works, and Mozilla still requires a nontrivial amount of work.

  There are certainly a lot of things to do, but that doesn't mean
that it doesn't work.

  On Windows 2k/XP, the _default_ Mozilla build works almost as well
as MS IE for complex scripts (except for rendering justfied text
and cursor movement/selection). On Unix/Linux and Win 9x/ME,
you need a CTL-enabled build and the right font.


 (Posted to mozilla-build or so. Awaiting moderator approval.

 If you had used the newsserver (news.mozilla.org) instead of
the mailing list, it'd have been just posted without approval.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-25 Thread Jungshik Shin

On Tue, 23 Dec 2003 [EMAIL PROTECTED] wrote:

 Recently I noticed that for me the sequence U+092C U+093F (b i)
 is rendered by Mozilla as b followed by i, while in fact the i glyph
 should precede the b glyph.

 Is something wrong in my expectations? or in Mozilla? or in my
 Mozilla 1.5 setup?

 Devanagari is not supported by the default Mozilla build on Linux
(as noted in the international known issues page.)  On Windows 2k/XP,
Devanagari, Thai, Tamil, Korean and other complex scripts supported by
Uniscribe are supported (although somewhat limited) if you install any
of complex script support packages (go to Control panel / International
or something like that) and reboot.  On Windows 9x/ME, only Tamil and
Korean are supported with 'special' fonts. Thai is supported only
on Thai version of Win 9x/ME.

 If you want to make Mozilla support Devanagari on Linux, you have to
download the trunk source from the CVS, build with 'enable-ctl',
and 'gtk' (for gtk2 + ctl, see mozilla bug 189433) If you like 'Xft'
(as many others do and I strongly recommend), turn on 'enable-xft'
as well. Then, install SunIndic font (truetype version for 'Xft')
available at http://developer.sun.com/techtopics/global/index.html
(follow the link for free Indian font).

 (Funny setup, to be broken by default, but even the release page
 http://www.mozilla.org/releases/mozilla1.6b/known-issues-int.html
 mentions this. See also
 http://bugzilla.mozilla.org/show_bug.cgi?id=201746 .)

 Nothing funny. Complex script support is not that simple especially
when you have to retrofit it. I'd love to turn it on by default, but the
cursor movement issue has to be resolved before turning it on (see bug
203406 as well). And, eventually, we have to use Pango (see bug 215219).

 that source was so dirty - the produced binary failed with errors like
  ./mozilla-bin: relocation error:
 mozilla/dist/bin/components/libeditor.so:
 undefined symbol: GetViewExternal__C8nsIFrameP14nsIPresContext

 In the mozilla binary directory, you have to run

 $ sh run-mozilla.sh ./mozilla-bin

By directly running 'mozilla-bin', you made it pick up
symbols from some other places (probably, system-wide nspr/xpcom/*
shared libraries installed on your system.)

 BTW, see also http://sila.mozdev.org

 Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: devanagari question

2003-12-25 Thread Jungshik Shin

On Wed, 24 Dec 2003, Jan Willem Stumpel wrote:

 It would be nice if solutions to common problems (in this case
 'how to put an UTF-8 string on to the screen', solved, e.g., by
 Openoffice) were shared between different open-source projects.

 OpenOffice uses ICU's layout engine that supports some complex
scripts but not all complex scripts. In case of AbiWord, I don't know
anything about its internals, but ICU and Pango (http://www.pango.org)
are two obvious choices (both are open-sourced) if its developers want
to support complex scripts (Brahmi-derived scripts - Devanagari, Tamil,
Telugu, Thai, Lao, Khmer, Tibet, etc-, Korean Hangul, Mongolian).
Does it support scripts that require BIDI/RTL (Hebrew, Syriac and Arabic
among others)? Also, note that even Latin, Greek and Cyrillic alphabets
are complex once you go beyond basic stuffs because some languages need
base letter + combining diacritic marks for which there's no precomposed
form in Unicode.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode fonts on Debian

2003-12-20 Thread Jungshik Shin

On Sat, 20 Dec 2003, Edward H. Trager wrote:

 On Saturday 2003.12.20 15:06:11 +0100, Jan Willem Stumpel wrote:

   Actually, no. I think I already explained this.
 
  Yes, you did (on 15 December). Sorry. I stand corrected. So: the
  default language group is determined by the UTF locale (which

   s/UTF// :-)

  incidentally also determines Mozillas GUI font). On Linux, the
  default language group determines the fonts which Mozilla tries to
  use (by preference) for displaying all Unicode characters. On

  Yes, unless there are other pieces of information that are more
relevant.


  Windows, the preferred font is determined by the code range, which
  seems more sensible, and in your bug report you propose to have
  the same mechanism on Linux also.

 I second that: Regardless of what mechanisms are used, it would be very nice
 if Mozilla worked identically on Linux and on Windows.
 (moved below)
 Also, I assume that it would lead to some slight simplification of
 the Mozilla code base,

  Nobody would ever disagree with you. Do you seriously believe Mozilla
developers would make their tasks more difficult not doing what you
wrote? However, the reality is not that simple. Note that on Linux/Unix
alone, we have a few different toolkits/font technologies to support that
are very different in their characteristics (XLFD vs fontconfig). Aside
from Linux, gecko-based browsers run not only on Win 9x/ME and Win2k/XP
(they're different OS' in many aspects) but also on several Unix', OS2,
Mac OS X, Qnx, and VMS (and an unknown number of embedded devices). There
might (or might not) be a way to abstract away all these platform/toolkit
dependencies, but the current level of the abstraction in Mozilla is not
there yet.  If we could use 'fontconfig' (+ pango or ICU) _everywhere_,
it'd be easy to do that. However, we'd not want to ask Mozilla
users on Windows or Mac OS X to install fontconfig + pango or ICU.
Including them into Mozilla is obviously out of question because Mozilla
without them is already too 'fat'.

 That makes it much
 easier for developers who have to test whether web pages look the same on
 different platforms.

  Well, the platform-dependent font availability is another important
factor that makes the platform parity hard to achieve.


  Probably not :-( , because when I try it on Win98 with Mozilla
  1.5, accessing a page with span lang=ru /span ,
  Putin is in the Cyrillic preferred font, while Yeltsin is in the
  Western font. Exactly the same as in Linux.

 There's another factor I didn't mention that affects when/whether
'Unicode char. to script' mapping kicks in. Mozilla-Win tries to stay in
the currently selected font as much as possible to avoid 'ransom note'
style (which looks horrible in some cases) rendering. Therefore, as long
as the current font can cover Cyrillic letters, I believe it wouldn't
switch.  However, I guess 'lang=ru, xml:lang=ru' is regarded as a strong
indication of the authorial intent that warrants the font switching.
(it's been a while since the last time I looked at that part of the code
so that I'm just writing from memory.)

  BTW, Mozilla doesn't do any 'global optimization' [1] in the
font selection as might be done by some word processors or other rendering
engines/libraries (e.g. Pango or ATSUI on Mac OS X). That is, its text
drawing/measuring routines can take only a small text chunk (sometimes
just a single character) at a time and doesn't know anything beyond that.


  So I _still_ dont understand it (including your bug report).
  Apologies in advance if I have overlooked something obvious..

  You don't have to apologize. It's complicated and the only
way to understand it fully is to read the code and work on it. Although
I worked on Windows and Gtk (Linux/Unix) ports of Mozilla's text
drawing/measuring routines for a while, I don't claim to know every
gory detail. What's certain is that Mozilla developers try to match
what's stipulated in the CSS specification (http://www.w3.org/TR/CSS2)
[2]. Whether they're successful or not is another matter, though.

 Jungshik


[1]
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/PS/FontComposition.ps.gz
[2] See, for instance, http://bugzilla.mozilla.org/show_bug.cgi?id=227889
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode fonts on Debian

2003-12-19 Thread Jungshik Shin
On Wed, 17 Dec 2003, Jan Willem Stumpel wrote:

 [EMAIL PROTECTED] wrote:

  http://ken2403king.kir.jp/form.htm

 Thats a funny one, indeed. When I opened it in Mozilla it was
 displayed as .For a moment I thought it
 was Chinese (which I do not know) but it is gibberish. Mozilla
 thought it was Chinese Simplified GB 18030. The source says html
 LANG=ja. It is Japanese with shift-jis encoding, in reality it
 says . (Isnt Unicode fun, allowing to put
 both variants in a mail message, just by copying from the Mozilla
 screen like this..)

 So, isnt the LANG attribute *more* irrelevant, because it did not
 help Mozilla (1.5a) to display the text correctly?

  It's impossible to infer the document encoding from 'lang' tag.
With NCRs, any document encoding can be used to represent any Unicode
characters. Even if that's not the case, how could you know if it's
Shift_JIS, EUC-JP or ISO-2022-JP or EUC-JP (with JIS X 0213) _purely_
based on the value of 'lang' (suppose we don't have UTF-8, UTF-16, UTF-32,
for the sake of argument).  The value of 'lang' plays a role ONLY after
the identity of characters in documents are determined. See below.

 A META tag
 attribute charset=shift-jis added to (a copy of) the page did.
 Doesnt that mean that encoding is more relevant than language?

 Internally, Mozilla works in terms of Unicode. That is,
it has to determine the document encoding correctly (to convert a
'byte stream' in the document to render) to a Unicode character 'stream'
before doing any font selection.  If it mistakes Shift_JIS for GB18030,
what the character drawing routine receives doesn't make sense and the
'langGroup' inferred from the document encoding is in conflict with
(with NCRs to represent any Unicode characters, whether they're covered
by the current document encoding, this could happen all the time) the
language specified in the document(a part thereof). Which one is given a
higher priority? IIRC, it's the latter. So Mozilla tries to render what
it regards as 'a document in GB18030' (which is actually in Shift_JIS)
with Japanese fonts if possible.

BTW, as you know, GB18030 is another UTF  so that even without resorting
to NCRs (#x(hh); or #..;) it can cover the full range of Unicode.

  Another BTW, it depends on your setting in
View | Character coding | Autodetect setting which character encoding
Mozilla comes up with for unlabelled documents.  If it's set to Chinese,
it'll come up with one of Chinese encodings for a Shift_JIS document.
Therefore, properly labelling html/xhtml/css documents is very important. Try
the document in question with the html/xhtml validator at
http://validator.w3.org and see what it says)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode fonts on Debian

2003-12-19 Thread Jungshik Shin
On Fri, 19 Dec 2003, Eric Streit wrote:
 I have a small question ...

 The pages are perfectly rendered on the screen, but when it comes to
 printing, only one encoding is done and all the other glyphs are
 converted to missing-caracters.

 Why not Mozilla ?

That's partly because Mozilla's printing on Unix have a lot of things
to improve and partly because you didn't configure it properly. Well,
the latter is also partly due to the former (it should be easier and more
intuitive to configure). In my posting in this thread, I explained three
different printing 'modules' and gave some refernces. If you're interested
in printing Latin letters and Cyrillic letters, all three methods should
work, but Xprint and Freetype printing should give you better results
than the default PS module (which is always the case for any script).
How to use Xprint with Mozilla is well documented in
http://xprint.mozdev.org. As for freetype printing, you have to
edit either the global (system-wide) unix.js (found in
places like /usr/lib/mozilla-1.5/defaults/prefs/unix.js. From this,
you may guess where it's actually placed on your system) or
per-profile configuration file prefs.js in
$HOME/.mozilla/profile_name/salted_name/prefs.js (where
salted_name is like 'k9xkxtyu.slt') to add the following:

pref(font.FreeType2.enable, true);
pref(font.FreeType2.printing, true); //on by default in mozilla.org builds.
pref(font.freetype2.shared-library, libfreetype.so.6);
pref(font.directory.truetype.1, /true/type/dir/1st);
pref(font.directory.truetype.2, /true/type/dir/2nd);

pref(font.directory.truetype.n, /true/type/dir/nth);

where /true/type/dir/1st' and '.../nth' are directories with truetype
fonts.

If you edit the latter (per-profile user configuration), you have to use
'user_pref' in place of 'pref'. The latter should be edited while Mozilla
is NOT running. Alternatively, you can edit them by typing 'about:config'
in the location bar. In the 'filter' box at the top of the page, type
'freetype' and you can change the value as you wish by right-clicking
with a pref. entry you want to edit selected. If you want to add a new
entry, you can choose 'New | Entry type' in the pop-up menu that comes up.

Hope this helps,

Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode fonts on Debian

2003-12-16 Thread Jungshik Shin
Edward H. Trager wrote:
On Saturday 2003.12.13 15:23:30 +0100, Jan Willem Stumpel wrote:

Does anyone have a step-by-step description of how to install
Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni?
I am still puzzled on when exactly what font is used for display
and for printing in the various Mozilla versions. Each time I
think 'I got it' it turns out that 'I didnt get it'...



I don't know whether the following page will answer your question or not:

http://eyegene.ophthy.med.umich.edu/unicode/#fonts


In Edit|Preferences|Appearance|Fonts, Mozilla provides options for specifying fonts
for various script encodings, so you should be able to fine tune exactly which fonts
get used.  
 I wouldn't use 'fine-tune' and 'exactly'. As I wrote in my previous 
messages, Mozilla's
font selection algorithm is complex and Mozilla contributors (including 
myself)  have put
a lot of time and efforts, but still there are issues. Besides, 
Mozilla's font selection
menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be 
called
'script group').  Only in Mozilla-X11core build,  the loose mapping between
'font encodings' (XLFD-based) and 'langGroups' exists.

There is also a checkbox to Allow documents to use other fonts which I
assume means that if the right glyph isn't found in the specified Unicode font, a 
glyph will get picked from whatever remaining installed font has that glyph. 
 No, that doesn't mean that. That checkbox controls whether or not 
author-specified
fonts (via font-family in CSS and font-face in old style html) should be 
given a higher priority
than fonts configured in Mozilla's font selection menu. If it's not 
checked, author-specified
fonts are ignored.

 I see
this happen when I view Chinese pages with unusual characters in them.
Whether the above option is turned on or not, Mozilla does its best to 
render every character.
If it fails, it falls back to transliteration on Windows and Linux (if 
X11core-build is used).
In case of Mozilla-Xft, it uses 4 (BMP) or 6 digit (non-BMP) hex number 
inside a
rectangle.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Unicode fonts on Debian

2003-12-16 Thread Jungshik Shin
On Tue, 16 Dec 2003, Edward H. Trager wrote:

 On Wednesday 2003.12.17 00:24:54 +0900, Jungshik Shin wrote:
  Edward H. Trager wrote:

  In Edit|Preferences|Appearance|Fonts, Mozilla provides options for
  specifying fonts
  for various script encodings, so you should be able to fine tune exactly
  which fonts
  get used.
 
  Mozilla's font selection
  menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be
  called
  'script group').  Only in Mozilla-X11core build,  the loose mapping between
  'font encodings' (XLFD-based) and 'langGroups' exists.
 

 I wish I understood this better!
 What exactly does langGroup or scriptGroup mean in Mozilla?  Can you point me to

 'scriptGroup' is just a term coined by me that I believe is better than
'langGroup' because it's not languages but scripts that are relevant
here. 'langGroup's in Mozilla include 'Western', 'Central European',
'Japanese', 'Cyrillic', 'Arabic', 'Hebrew', 'Tamil', 'Devanagari',
and so forth (just what you see in the font-selection dialog).

 a URL that explains exactly how Mozilla does these things, and how that might
 be different from, say, the xft/fontconfig way of doing things?

  I tried to explain it in my long email you quoted in your previous
email apparently without reading it. Maybe not very clearly, but my
two emails (before your first email in this thread) answered most of
your questions.


 Clearly, from a user's perspective I was led to believe something
 possibly quite different about these dialogs in Mozilla.

  What did you believe was the case? Then, I'll go from there if
necessary.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode fonts on Debian

2003-12-14 Thread Jungshik Shin
On Sun, 14 Dec 2003, Jan Willem Stumpel wrote:

 In the Mozilla font preferences you can set font preferences for
 Unicode, as well as for specific languages like Western, Japanese,
 etc. Am I then correct in assuming that the language-specific
 preferences always take priority over the Unicode preferences?
 Even when displaying a Web page which has charset=utf-8
 in the headers?

 Yes, it's confusing. I think we should get rid of the font
preference entry for Unicode because that's just confusing (there
is some use for it at the moment, though).  The font selection in
Mozilla is strongly influenced by 'langGroup' (had better be 'script'
or 'script group').  How is it determined? If there's an explicit
specification of the language with 'lang' in html and  'xml:lang' in
xml/xhtml in the document [1], it's honored. If not, it's inferred from
the document encoding. Obviously, this inference doesn't work at all
for utf-8. Currently, Mozilla uses the 'langGroup' corresponding to the
current locale for UTF-8 documents. That is, if you run Mozilla under
zh_TW.(UTF-8|big5|EUC-TW) locale, the langGroup of utf-8 document is
regarded as zh-TW. This doesn't work well and totally breaks down when
you have an iso-8859-1 (or any other non-Unicode encoding) documents with
a lot of characters outside the repertoire of ISO-8859-1 represented
in NCRs. (see http://bugzilla.mozilla.org/show_bug.cgi?id=208479 and
http://bugzilla.mozilla.org/show_bug.cgi?id=91190). To work around
this problem, Mozilla on Windows maps Unicode code blocks to Mozilla's
'langGroups', which achieves what you asked below.

 In other words is there a mechanism (inside
 Mozilla) that says

 -  hmm... I have to display the character with number 49436 (hex
 C11C) here.
 -  this character is in the range of Korean syllables.
 -  now has a language-specific Korean font been specified? If so
 Ill use it.
 -  If not, I use the Unicode font (Bitstream Cyberbit, or
 whatever).

 As I wrote above, on Windows, Mozilla does more or less what you
wrote above. Mozilla-X11core and Mozilla-Xft have different font selection
mechanisms. Mozilla-Xft is strongly dependent on fontconfig, which
gives usually a lot better result than the font selection mechansim of
Mozilla-X11core, but that also makes it hard to fix bug 208479 mentioned
above.


 In other words, are huge complete Unicode fonts like Bitstream
 Cyberbit or Arialuni (which I promise not to try to use again..)
 only used for filling in the gaps where there are no
 language-specific fonts available? There does not seem to be much
 point in having them, then?

  You can also configure Mozilla to use those pan-Unicode fonts
(or fonts whose coverage is broad enough) for all langGroups you're
interested in.

 Another question: does Mozilla consider 'Latin Extended A'
 characters like  (o with macron) to be 'Western'? Many Western

  As I explained above, Mozilla-Win does, but in Mozilla-X11core and
Mozilla-Xft, which character belongs to which langGroup is not a function
of Unicode code point (as it should be) but a function of the current
document encoding and the value of 'lang/xml:lang'.

 fonts (like Times New Roman) have them and display them fine.
 But for instance Bitstream Vera Serif does not have them, and some
 other font (I dont know which) is substituted. Which rules are
 used for this substitution? Does mozilla look for them in
 *another* Western font, or does it look in the 'Unicode' font?

  Mozilla's font selection mechanism is so complex that I can't
explain it in a few words (and it's also platform/toolkit dependent).
In Mozilla-Xft, fonts for 'Unicode' langGroup are mostly immaterial,
IIRC (I have to look up the code). Mozilla-Xft searches for a font
to render a character in the priortized list of fonts returned
by fontconfig.  Therefore, what fontconfig returns in response to
Mozilla's query (that usually specifies 'lang' and 'font family name'
but NOT characters to render) determines which font is used to render
which character. Mozlla-X11core is a different story.  Using 20-year
old XLFD makes it very hard to do things right (if you take a look at
nsFontMetricsGTK.cpp at http://lxr.mozilla.org, you'll see what I mean)
and I guess fonts specified for 'unicode langGroup' is refered to at a
certain stage.


  Mozilla's international release notes is your friend although
  we didn't give gory details in the document. In Mozilla, goto
...
 Thanks very much for pointing this out. I had found out about the

  You're welcome :-)


 As regards to printing:
 I have (and have had for years) just 'lprng' and 'magicfilter' to
 print on my old Laserjet IIP. Also xprint works with that (as far
 as it works). Is there any point for me (or in general for users
 wanting a 100 % Unicode system) in switching to CUPS?

  I guess magicfilter should be fine especially considering that
you have a non-PS printer. CUPS is handy when you have a PS printer
that's not quite up-to-date. Mozilla's FT2 printing 

Re: Unicode fonts on Debian

2003-12-13 Thread Jungshik Shin
On Sat, 13 Dec 2003, Jan Willem Stumpel wrote:

 Does anyone have a step-by-step description of how to install
 Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni?

Well, you're not supposed to install MS Arial Unicode on Linux at
least in some countries.  If you want to install a Pan-Unicode font,
you'd better install James Kass' Code2000(BMP) and Code2001(non-BMP).
They're available at http://home.att.net/~jameskass.  It'd be nice of you
to pay him $5. He's done a great service by making his fonts available
and deserves some monetary compensation, IMHO. You have to note that
for a good quality rendering, you'd better get fonts specifically
made for a subset of Unicode repertoire instead of pan-Unicode fonts.
Google 'alan wood unicode fonts' and you'll get Alan Wood's Unicode font
site. For Latin, you definitely need to install Bitstream Vera series
(donated by Bitstream). If you're also interested in Greek and Cyrillic,
a set of fonts made available by SIL (Gentium) are good to have.

 I am still puzzled on when exactly what font is used for display
 and for printing in the various Mozilla versions. Each time I
 think 'I got it' it turns out that 'I didn't get it'...

  Mozilla's international release notes is your friend although
we didn't give gory details in the document. In Mozilla, goto 'Help'
and 'Release Notes'. In the release notes web page, follow the link to
'international known issues'.  Basically, there are two different versions
of Mozilla for Linux and three different ways for printing.

  1. X11core font build(with gtk or gtk2 widget) :
 This is what's available by default
 at www.mozilla.org. It renders text using server-side
 X11core fonts, which can be bitmap (bdf), Speedo,
 type1, truetype, CID-keyed fonts, etc. However, all of them
 are 'presented' clients (in this case, Mozilla) as
 a set of glyphs with a certain char. to glyph mapping
 and metrics expressed in XLFD.

  1'  The X11core font build also can take advantage of truetype
  fonts available on the client side if freetype is
  enabled (font.FreeTyp2.enable has to be set to 'true'
  in prefs.js). By default, it's enabled. You have to add
  directories with truetype fonts by editing prefs.js
  in your profile directory (usually,
  ~/.mozilla/${PROFILE_NAME}/${SALTED_NAME}/prefs.js).
  The preference entries for truetype fonts are
  font.directory.truetype.1, font.directory.truetype.2, and
  so forth (Mozilla takes a look at the directory explicitly
  specified and does not look inside subdirectories.)
  Alternatively, you can add them in 'about:config' (type
  'about:config' in the location bar). In addition, you
  have to specify the location of your freetype2 shared
  library.

  2. Xft-based build (with gtk or gtk2 widget). This builds
 take advantage of  new client-side font libraries,
 Xft and fontconfig that in turn rely on freetype2 library.
 RedHat rpms available at ftp.mozilla.org are Xft + gtk2
 builds. I guess you can install one of them on debian
 with alien or similar tools. Usually, this builds gives
 faster and better rendering results especially if you're
 interested in viewing non-Western European web pages.

Now for printing.

  1. Postscript printing module : this is the oldest. Some people
 regard this as totally broken and demanded that it be
 removed. Western European users may not have much trouble,
 but if you go beyond that, it begins to show its limitation.
 Even for Western European text, its PS output is far from
 'WYSWYG'. That is, fonts used on the screen rendering have
 nothing to do with fonts used in print-out. It can be used
 with both builds listed above.

  2. PS + freetype2 : You have to enable both freetype (mentioned
 above) and freetype printing. This can be used with both kinds of
 builds. However, old rpms (Xft+gtk2 build) used to come with freetype
 disabled, but recent Xft+gtk2 at mozilla.org seem to have been built
 with freetype enabled.  This gives a reasonable (not very faithful)
 WYSWYG. It's not faithful because the font selection mechanism is
 different for printing and screen rendering. Combined with
 CUPS and other modern Linux print servers, this works rather
 well.

  3. Xprint (http://xprint.mozdev.org). With this, Mozilla
 is an Xprint client (X11) to an Xprint server. You need
 to have an Xprint server running for Mozilla to talk to.
 The font selection mechanism is XLFD-based. Xprint (client-side)
 is enabled in X11core build at mozilla.org, but is disabled
 in Xft+gtk2 build.  Xprint server is available at
 http://xprint.mozdev.org

 More can be found at the aforementioned international known issues
page and links therein.

  Hope this helps,

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



file system conversion tool

2003-12-05 Thread Jungshik Shin

Hi,

I thought some of you might be interested in 'convmv', a file system
encoding conversion utility I just came across. Most of you on this list
are likely to have switched over to UTF-8 and wrote a script or two for
the job.  Nonetheless, it may be handy to have tools like this nearby
so that you can help other 'skeptics' around you to 'convert' to UTF-8.

http://osx.freshmeat.net/releases/144059/

convmv converts filenames (not file content), directories, and even
whole filesystems to a different encoding. This comes in very handy if,
for example, one switches from an 8-bit locale to an UTF-8 locale. It
has some smart features: it automagically recognises if a file is
already UTF-8 encoded (thus partly converted filesystems can be fully
moved to UTF-8) and it also takes care of symlinks. Additionally, it is
able to convert from normalization form C (UTF-8 NFC) to NFD and
vice-versa. This is important for interoperability with Mac OS X, for
example, which uses NFD, while Linux and most other Unixes use NFC.
Though it's primary written to convert from/to UTF-8 it can also be used
with almost any other charset encoding. Note that this is a command line
tool which requires at least Perl version 5.8.0.


Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: Some links about UTF-16

2003-07-13 Thread Jungshik Shin
On Fri, 11 Jul 2003, Wu Yongwei wrote:

 S***, it seems I made a mistake.  The font selection in Windows 2000 is not
 at all as flexible as Java; it's more like Linux.  Just that the default
 font in the Simplified Chinese version is still Tahoma instead of Song Ti.

 Thanks for checking that out. You saved me some tinkering :-)


 Jungshik must be right that I could change the default font in locale zh_CN
 to make ASCII characters appear nicer.

  With Gtk2 and fontconfig, I don't have to tinker with the  font
configuration as much as before because it looks all right to me.
As for CSS-style font list specification, the infrastructure is already
in place (fontconfig), but the 'UI' part needs some catch-up to do.
For instance, most GUI programs and window managers don't have UI to
let multiple (ordered-list of) fonts be specified (although it's
possible to do so by editing configuration files manually in _some_
cases.)


  The only problem is that the
 standard locale for Simplified Chinese in Red Hat 8.0 (which I use) is
 zh_CN.GB18030.  I was told that it was possible to change that to
 zh_CN.UTF-8, but I did not find the motive/time to do that.

  It's rather easy. See
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829.


 Regarding the 'A' APIs in Windows.  Do you mean that there should be some
 API to change the interpretation of strings in 'A' APIs (esp. regarding file
 names, etc.)?  If that were the case, the OS must speak Unicode in some form
 internally.

  Yes, that's what I meant.  Beni already gave some details.

Beni  win2k does have the option of
Beni witching the encoding used in the 'A' APIs, it's just global and
Beni  requires a reboot.

 Yup, I frequently do to test Mozilla under different locales.
Having to reboot is really painful. On POSIX systems, we can just
run   a program under any supported locale at the command line. Under Win2k/XP,
'chcp' works inside a 'command prompt'  (even setlocale() works), but I haven't 
checked out
if there's 'SetACP' (the opposite of 'GetACP').


 remount the partition in an appropriate encoding; if it is on an EXT2/3

  As you found out, there's a tool or you can easily make one as many other
have done.  Once you switch to UTF-8 locale, there's no need to look back.

 partition or on a CD-ROM, then I am out of luck.  Maybe the mount tool
 should do something to handle this? :-)

   In case of CD-ROM, it's not much of an issue. See mount(8) man page and other
man pages referred there.

   Jungshik


P.S. A word of caution. A lot of _text-mode_ programs still assume that a single octet
takes a single screen 'cell', which holds for most legacy single byte and double
byte encodings. This assumption breaks down for UTF-8 and three byte sequences of
EUC-JP and four byte sequences of GB18030 (and eight byte sequences of EUC-KR).
Some of them are modified to cope with two-byte UTF-8 sequences (U+0100 - U+07FF),
but don't work with U+0800 and beyond. Needless to say, combining characters
are not handled in those programs.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: Some links about UTF-16

2003-07-10 Thread Jungshik Shin

On Thu, 10 Jul 2003, Wu Yongwei wrote:

 Jungshik Shin wrote:

  I think it's not so much due to defects in programs as due to the lack of
  high-quality fonts. These days, most Linux distributions come with free
  truetype fonts for zh, ja, ko, th and other Asian scripts. However,
  the number and the quality of fonts for Linux desktop are still
  inferior to those for Windows.

 The problem is mainly not font itself, but font combination.  I really
 cannot bear the display of ASCII characters in Song Ti, which is simply ugly
 (and fixed width).

  Why don't you specify a variable-width font as the system default?
I understand you still don't like Latin glyphs in Chinese fonts. I hate
Latin glyphs in Korean fonts, too.


 locale Linux seems to be able to do so, but in the Chinese locale all is in
 the Chinese font, which is not suitable at all for Latin characters.

  I don't think there's any difference between English and Chinese locales
provided that you meant en_*.UTF-8 and zh_*.UTF-8. You may get an impression
that it seems to work under en_US.UTF-8 because the 'system default font'
for en_US.UTF-8 does not cover Chinese characters and the automatic font
selection mechanism picks up a Chinese font for Chinese characters while
using the default font for Latin letters. On the other hand, in zh*.UTF-8,
the system default font covers Latin letters as well as Chinese characters
so that both Latin/Chinese are rendered with the default font.

  A way to work around is to specify your favorite Latin font ahead
of your Chinese font if CSS-style font list can be used.

 Beginning with Windows 2000, Windows could choose the
 font to use based on the Unicode range (Java does this too).  In the English

  This is a  good feature to have although CSS-style font list works
most of time.  Almost everything we need for this is already in
place (fontconfig, pango). BTW, I haven't seen this available in
Win2k. How can I do that? (not that I don't believe you but that
I'm curious)



 I used an Windows Gtk application, which used Tahoma (an good sans serif
 font) at first.  But after an upgrade it automatically chose to use the
 system default font, which is the Chinese Song Ti.  It took me several hours
 to correct the ugly and corrupt (yes, because dialogue dimensions are
 different) display.

  Again, I haven't run Gtk programs under Win32 so that I don't know how
they select fonts. Do they use fontconfig? fontconfig can make a big
difference.


  There seems little sense now arguing the virtues of UTF-8 and UTF-16.
  Technically they both have advantages and disadvantages.  I suppose we

If MS had decided to use UTF-8 (instead of coming up with a whole new
  set of APIs for UTF-16) with  'A' APIs, Mozilla developers' headache(and

  UTF-8/'A' APIs vs UTF-16/'W' APIs and there are many other things to
  consider in case of Win32.


 It seems impossible because there are some many legacy applications.  On the
 Simplified Chinese versions of Windows, 'A' always implies GB2312/GBK.
 Switching ALL to UTF-8 seems too radical an idea about 1994.  At the time

 Using 'A' APIs and UTF-8 does not mean that 'A' APIs are made to work ONLY
with UTF-8.  As you know well, 'A' APIs are bascially for APIs to deal with
'char *'. As such, in theory, it can be used for any single or multibyte encodings
including Windows 932, 936, 949, 950 and 6(I forgot the codepage
designation for UTF-8).

 As Unix(e.g. Solaris and AIX and to a lesser degree Linux) demonstrated,
a single application (written to support multibyte encodings) can work
well both under legacy-encoding-based locales and under UTF-8 locales.


 Microsoft adopted Unicode, people might truly believe UCS-2 is enough for
 most application, and Microsoft had not the file name compatibility burden
 in Unix

  Well, this is an orthogonal issue. POSIX
file system is so 'simple' (which is a virtue in some aspects) that it doesn't
have an inherent notion of 'codeset/encoding/charset'. However, Windows
doesn't use POSIX file system and  using 'A' APIs does NOT  mean that they
couldn't use VFAT or NTFS where filenames are in a form of  Unicode.


 (I suppose you all know that the long file names in Windows are in
 UTF-16).

  Actually, VFAT documentation is so hard to come by that we can just
speculate that it's UTF-16 (it could well be just UCS-2 in Windows 95)


 I would not blame Microsoft for this.

  I wouldn't either and I didn't mean to. I believe they weighted
all pros and cons of different options and decided to go with their
two-tiered API approach. In my previous message, I just gave a downside to
that approach aggregating all other arguments into a single phrase
'there are many other things to consider.'

 Also consider the following
 fact:  Windows 95 emerged at a time when many people had only 8MB of RAM.
 Yah, I don't think AT THAT TIME we could tolerate a 50% growth in memory
 occupation.

 Windows 95/98/ME are not Unicode-enabled in many senses while

Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin
On Tue, 8 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:

 Dnia wto 8. lipca 2003 05:22, Wu Yongwei napisa³:

  Is it true that Almost all modern software that supports Unicode,
  especially software that supports it well, does so using 16-bit Unicode
  internally: Windows and all Microsoft applications (Office etc.), Java,
  MacOS X and its applications, ECMAScript/JavaScript/JScript, Python,
  Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape,
  OpenOffice/StarOffice, ... ?

 Do they support characters above U+ as fully as others? For Python I know

   Yes. . At least, I know for sure Mozilla and MS IE, MS Office XP
do.  That does not make me a fan of UTF-16.  You shouldn't assume
that others don't do what you're not happy to deal with.

The reason they use UTF-16 is NOT because it's inherently better
than other UTF's(UTF-8, UTF-32) BUT because they (not all) began
with UCS-2 and have a lot of baggages (written in UCS-2) to carry
on.  The prime example of this Win32 W API's. The same is true of
Java, ECMAScript (the transition is not yet complete in case of
ECMAScript), and Mozilla.  (see
http://bugzilla.mozilla.org/show_bug.cgi?id=183156, for instance)


In case of applications written with UTF-8 as the internal string
representation (asked for in another posting), there are lots of
them. Basically, most gnome/gtk applications do because glib and
pango are based on UTF-8. Moreover, there's a programming language
whose internal char. representation is UTF-8 as is well known. It's
Perl. Besides, judging from the fact that Sun's iconv(3) implementation
uses UTF-8 as a hub (instead of UTF-32 as is the case of glibc's
iconv(3)), many programs in Solaris must be heavy users of UTF-8.


  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin


On Tue, 8 Jul 2003, srintuar26 wrote:

  Is it true that Almost all modern software that supports Unicode,
  especially software that supports it well, does so using 16-bit Unicode
  internally: Windows and all Microsoft applications (Office etc.), Java,

 These decisions seem designed mostly to ease compatibility with
 Microsoft's OS.

  I agree. Or, for the lack of foresight...

 The Asian-language argument for UTF-16 seems
 mostly vacuous, and even if it were true it would be the lone

   Here again I agree. The worst case (text made entirely
of chars. between  U+0800 and U+) is 3:2.  With characters
below U+0800 (especially US-ASCII range) mixed up, the ratio is
even lower. For CJK Ext. B and C, UTF-8, UTF-16 and UTF-32 are all
even.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin
On Wed, 9 Jul 2003, Wu Yongwei wrote:

 (excluding the desktop, which I prefer KDE).  But I did have some bad
 experience with Windows Gtk applications running on Chinese versions of
 Windows.  Not for functionality, but for UI.  You are right that they do
 care about Asian languages, but the problem seems that they may not have the
 hands to test on Asian language platforms.  At least not on Simplified
 Chinese Windows.  Not their fault, I must add.  Ah, I cannot bear setting

   I have no experience with Windows Gtk, but it could well be due
to the fact that Win32 APIs come in two flavors, 'A'(NSI) APIs and 'W'
APIs.  MS recommened a few different paths to support both pre-Unicode
(ANSI-based ) Windows (Win 9x/ME) and Unicode-based Windows
(Win2k/XP). One of them is to use 'MSLU'(Microsoft Layer for Unicode?)
with pure 'W' APIs (not using 'A' APIs at all). Mozilla developers
once considered this approach, but gave it up because it led to a
dillemma. To make Mozilla run under Win 9x/ME, Mozilla developers have to tell
Mozilla users to install MS IE 5.x or later (or MS Office or other programs
that have license to bundle MSLU dll with themselves).  Obviously,
it doesn't make much sense to ask users to install its competitor before
using it (needless to say, the reality is that virtually MS Win users
have MS IE installed so that we don't have to worry...). There may be
other reasons that MSLU path was not taken that I don't know of.

What Mozilla ended up doing is to write our own wrappers and function
pointers for two dozen or so of Win32 APIs that get pointed to
either A APIs or W APIs according to the run-time detection of the
OS (Win9x/ME vs Win2k/XP). Mozilla's transition to this is not yet
complete  (see http://bugzilla.mozilla.org/show_bug.cgi?id=162361 and
http://www.mozilla.org/releases/mozilla1.4/known-issues-int.html)

  It's likely that Win32 Gtk is still dependent on 'A'NSI APIs. However,
this is a pure speculation and could well be completely wrong.


 Linux locale to Chinese, which makes the desktop too ugly to me.  Rationale:
 The good intent of Open Source developers may not result in understanding
 the requirements of Asian users owing to lack of native
 developers/testers/users.

  That's a bit strange. My desktop under ko_KR.UTF-8 locale is not so bad.
  Anyway, it's not yet as pretty as that of Win32.

I think it's not so much due to defects in programs as due to the lack of
high-quality fonts. These days, most Linux distributions come with free
truetype fonts for zh, ja, ko, th and other Asian scripts. However,
the number and the quality of fonts for Linux desktop are still
inferior to those for Windows.


 There seems little sense now arguing the virtues of UTF-8 and UTF-16.
 Technically they both have advantages and disadvantages.  I suppose we have
 presented enough of them in this discussion.

  Let me just add my last comment...

  If MS had decided to use UTF-8 (instead of coming up with a whole new set of
APIs for UTF-16) with  'A' APIs, Mozilla developers' headache(and that of
other opensource developers) mentioned above would have been a lot easier
to cure :-) Of course, this is just one aspect of UTF-8/'A' APIs vs
UTF-16/'W' APIs and there are many other things to consider in case of Win32.



  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Strings in a programming language

2003-07-06 Thread Jungshik Shin
On Mon, 7 Jul 2003, Wu Yongwei wrote:

   I wonder, how many people really want to use Unicode codepoints
 beyond
   U+?
 
  I don't want to make it incorrect by design just because cases it
 doesn't
  handle are rare.

 It's unnecessary to handle ALL cases.  You could address only issues
 encountered/expected by your end users.  IMHO, it is more important to
 make an application be light-weight and run in 99% cases.  Or, you may
 find your language used by, say, 1 people, and none uses the extra
 features that you spend 40% of your development labour.  And it is

  As you wrote, one can do what one believes. Anyway,  correctly
handling non-BMP characters are not so much difficult (40% of your
devel.  time for 1% constituency seems to me too big an exaggeration
:-) I know you're just maing your case clear...).  Moreover, with
Math characters in plane 1 and MathML more widely used, it'd not be
so rare to find people who want to use non-BMP characters.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: diacritic marks for Latin alphabet (Re: supporting XIM)

2003-04-02 Thread Jungshik Shin
Edward Cherlin wrote:

On Monday 31 March 2003 10:05 pm, Jungshik Shin wrote:
  

Let's try some more.
aeiounx
  


I'm pleased that the accents are still there after four levels of 
replies.


  That's because all three of us (Gaspar, you and I) do what we preach,
namely, using UTF-8 in our everyday computing :-)

Not too bad, except that only the first three accents on
each letter are actually displayed, and the dot on the i
isn't removed.
  

  Hmm, I can see only two diacritics in Kwrite with Code2000



Yes, I get only two visible diacriticswith Code2000.

   I think Code2000 has some (maybe not so
comprehensive) ot layout tables for Latin letters. I'm copying
this to its author, James Kass. 
  

font. I found that you appended as many as five of them to
each character in your sample.  What font did you use?
Nonetheless, it's a pleasant surprise that Kwrite does more
than simple overstriking.



kwrite 4.0
kde 3.0.3
Arial Unicode MS (licensed copy) shows 3 diacritics
  

Can you check your font with VOLT (www.microsoft.com/typography)
as to whether it has OT layout tables for Latin letters?  You need
to apply to join the OT developer group to get a copy.
It seems to be the only tool available for  editing OT layout
table.  I hope pfaedit will offer the feature, soon.
 

kmail 1.4.3
Courier [Adobe]
3 diacritics displayed
  


Courier?  Hmm.  How about 'Courier' in kwrite?
So, are multiple diacritics stacked over each other taking *disjoint*
spaces instead of overlapping one another at the same spot?

  Anyway, now I'm wondering what Qt/KDE use for rendering.
Does it use pango(it couldn't be because Pango
doesn't support OT layout table for Latin, yet although
simple overstriking is supported) or has their own complex script rendering
library?

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Mozilla Rendering (was Re: gtk2 + japanese; gnome2 and keyboardlayouts)

2003-04-01 Thread Jungshik Shin
Edward Cherlin wrote:

On Tuesday 01 April 2003 08:02 am, Edward H Trager wrote:
 

Can Jungshik or someone else please clarify for me what
Mozilla 1.3 currently uses for complex script rendering? I'm
seeing differences in rendering of Thai on Linux (horrible)
vs. in Windows (OK) in Mozilla 1.3. 
   

Uniscribe on Windows. It supports Thai. 

  Well, I guess even on Windows, Mozilla does not make use of Uniscribe
(at least it doesn't explicitly as far as I know) and intelligent
fonts with opentype layout tables.  Actually, I'm not sure. I asked 
about this a
couple of times, but got no answer.

I don't know what it uses on Linux, but it uses something that 
doesn't support Thai properly, 

 It sorta does if you compile it with CTL(complex text language) 
feature turned on.
Mozilla source code includes a 'miniature version' of Pango for rendering
a couple of Indic scripts and Thai(contributed by Sun). However, that's 
only for 'plain gtk'
build of Mozilla (not using Xft but old X11 core fonts). A similar 'hack'
(but not depending on Pango) should be possible for Xft-build of Mozilla 
when bug
176290 is resolved  (http://bugzilla.mozilla.org/show_bug.cgi?id=176290)

This is the point about building text rendering into the system. 
Applications cannot have their own rendering engines in general. 
So whatever the system renderer supports is the best you can 
expect in most software (if that).
 

  I fully agree with you. The problem with the current Mozilla is that 
it seems rather
hard to write a bridge to Pango (although I have a couple of 'vague' 
ideas as to how
to do it and I'm sure genuine gurus of Mozilla have their own better 
ideas as well.)
Besides,  I believe Mozilla-Graphite 'marriage' should serve as a good 
model
for Mozilla-Pango couple.

Jungshik

P.S. BTW, Thai can get rendered 'automagically' (well, not so great as 
expected
by Thai people) if you have fonts for simple overstriking with zero/negative
advance for combining characters.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-04-01 Thread Jungshik Shin
Edward Cherlin wrote:

On Monday 31 March 2003 10:40 pm, Jungshik Shin wrote:
 

Edward Cherlin wrote:
   

Have you looked at SILA? It uses SIL Graphite as the renderer
for Mozilla.
http://sila.mozdev.org/
 

Yup. I'm aware of it.  At least for now it's only for Windows,
though. However, we may get some valuable insights from the
project that can be applicatble to 'Mozilla-pango' marriage.
   

I mean the part of the project that says they want to do a Linux 
port of Graphite, and thus of SILA, but not much is going on 
with it.
 

 A couple of issues: I guess OpenGraphite for Linux is not yet ready 
for the prime time
while Pango is mature. SILA currently uses MS COM instead of xpcom. To
make SILA for Linux, MS COM needs to be replaced by xpcom. We'll see
which one gets there first, OpenGraphite or Pango. 

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: diacritic marks for Latin alphabet (Re: supporting XIM)

2003-04-01 Thread Jungshik Shin
Pablo Saratxaga wrote:

The only latin-script based languages I know that use some accentuated
letters not existing in precomposed form in unicode are Guarani
(it uses g with tilde) and Chechen (it uses several letters with
a dot above, some exist in precomposed, but others don't).
There may be others, but I only know about those two.
 

  I think orthographies of some African languages also need  Latin 
letters with diacritics for which
Unicode/ISO 10646 have never assigned and will never assign precomposed 
fomts.
And, if  we consider  Old and Middle  European languages, there are  
even more.
Needless to say, IPA(although not a language) is a very 'fertile' source 
of  a number of  accented letters.
(I believe there  are some IPA letters linguists want to use that are 
not given separate
codepoints.)

I didn't and wouldn't count  math symbols here  although  there are  a 
lot of them
with Latin letter  as base char.

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: opentype

2003-04-01 Thread Jungshik Shin
srintuar26 wrote:

(For the sake of argument, if all precomposed glyphs were abolished,
leaving NFC==NFD, then how would we store composition specializations
inside fonts...)
 

 You have to distinguish between characters and glyphs here. The number 
of Unicode
characters representable with a font is different from the number of 
glyphs in the font.
Because as you wrote,  diacritic marks for Latin/Greek/Cyrillic and other
combining characters take different shapes and different positions depending
on where they're used. The same is true of base characters The shape of
a base char. is different whether it's used alone or combined with combining
characters and how many and which combining characters it combine with.

In modern intelligent fonts like opentype fonts, char to glyph mapping 
is not
1 to 1 but m to n where m and n = 1. 
The way this m to n mapping is
stored in fonts and accessed by  rendering/layout engines varies.
(there's even a proposal to add this intelligence to old X11 BDF.)
Opentype fonts have  layout tables like gsub and gpos that have to be
accessed and activated by rendering engines like Uniscribe and Pango.
The amount of intelligence in embedded opentype fonts is smaller than
that in AAT (Apple's intelligent font format) in that in the former
Uniscribe and Pango should more work than necessary for AAT fonts.
Graphite is another font format(? it uses opentype format, but
its layout tables are different from gsub/gpos and so forth used by
Pango/Uniscrbe) and rendering library pair.

For details, see http://www.microsoft.com/typography
   http://developers.apple.com/fonts
http://www.pango.org
   http://graphite.sil.org
   and Adobe's page
Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


a patch for vim to add 'cjkw' option for CJK users with CJK monospacefonts

2003-04-01 Thread Jungshik Shin
Hi,

Attached is my patch to add 'cjkw(idth)' option to toggle CJK width
option. When turned on, characters with East Asian width class
of 'A'(mbiguous) (see UTR #1? 'East Asian Width) are treated as
having the cell width of 2 instead of 1. The default is off(because
characters affected had better be treated as having the cell width
of '1' 'typography-wise' ) and it's only effective when the fileencoding
is UTF-8.
This option is necessary because in the GUI mode (and in a terminal
where a CJK font is used or a similar option is turned on.
e.g. xterm with 'cjk-width' option), many East Asian
people (CJK) use CJK fonts which have fullwidth (cell width of 2)
glyphs for characters with EA Width class 'A'. With
this patch and 'cjkw' turned on, there's no more inconsistency
between the width of glyphs  for characters like Euro, registered
sign, copyright sign in those fonts  and that perceived by vim.
FYI, xterm has a similar option 'cjk-width'.  Lik xterm,
my patch uses Markus Kuhn's EA width 'A' character table
automatically generated from Unicode 3.2. When Unicode 4.0
is finalized, the table has to be updated.
It'd be nice if the patch can get in soon.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: alias in fontconfig (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

- Xmms cannot display non-8bit languages (music titles and so on).

 

  Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1  tag 
as long as
the codeset of  the current locale is the codeset used in ID3 v1 tag.  
   

I'll test this further.  However, please note I won't be satisfied by
i18n which require specific configuration other than setting LANG
variable (and installing required softwares and resources).
 xmms does NOT take  anything more than setting LANG. The reason I used 
LC_ALL in
my example is because that's the only sure way to set the locale. If I 
use LANG,
it can get shadowed by LC_ALL and LC_*.  LC_ALL overrides 
LC_* and LANG. Other complications are not the fault of xmms but that 
of  ID3 v1 tag
that does not have any mechanism for specifying the encoding.  ID3 v2 should
solve this problem by using Unicode, but not many programs support it. 
(I doubt
many  portable mp3 players  support it)

I want such alias to be automated.  If I have one Korean font installed,
it is obvious that renderer must use the font for all Korean texts.
It is not a good idea that the renderer fail to display Korean when
the user doesn't configure the alias.
   fontconfig always returns a font if there's a font on the system 
with the character requested.
So, it's possible now.

 

- There are no lightweight web browser like dillo which is i18n-ed.
 

I think that w3m-m17n is an excellent lightweight browser that 
supports I18N well.
   

Well, I meant a lightweight GUI browser.  Though I haven't checked,

 

  It's sorta gui browser. It supports image rendering and mouse.  You  
can also compile it with
JS interpreter .BTW, how about 
Phoenix(www.mozilla.org/projects/phoenix) and Galeon ?

There is another i18n extension of w3m: w3mmee.  I don't know which
is better.
 

 I'm aware of that. I just wish either of them  (or a combination of 
two) to be included in
w3m.

- FreeType mode of XFree86 Xterm doesn't support doublewidth characters.

 

  Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect
he'll apply it sooner or later. After that, I'll add '-faw' option 
(similar to '-fw' option).
   

Fantastic!  May I want more?  Xterm can automatically search a good
(corresponding) doublewidth font in non-FreeType mode.  How about
your patch?
 

I'm not sure whether I can.  We'll see.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: supporting XIM

2003-03-31 Thread Jungshik Shin
Jungshik Shin wrote:

Edward Cherlin wrote:

The starting point of this discussion was the inability to use 
Chinese, Korean, and Japanese IMEs in the same locale. I write 
documents in all three languages, and I would do it more often if it 
were actually convenient.

 This is becoming rather frustrating. How many times do I have to write
that it IS possible right now to install all of them and switch
between them in a *single* application (session) running under any
UTF-8 locale of your choice?   Why don't you try installing


 I'm sorry I  somehow didn't realize (how couldn't I? I don't know...) 
that
you wrote the above probably because I had written that everything that 
you need
for CJK input came by default with modern Linux distros, which  is not
true, and you don't need HOWTO.  Certainly, it's not well known that
it's possible to switch between multiple gtk2 input modules (including
those for CJK)   and it'd be nice to have a well-written summary on the
issue with pointers to various gtk2 input modules. It also would be nice
for major Linux distributions to include them.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: alias in fontconfig (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin



On Mon, 31 Mar 2003, Edward Cherlin wrote:

 On Monday 31 March 2003 04:31 pm, Jungshik Shin wrote:
 Tomohiro KUBOTA wrote:
  I want such alias to be automated.  If I have one Korean
   font installed, it is obvious that renderer must use the
   font for all Korean texts. It is not a good idea that the
   renderer fail to display Korean when the user doesn't
   configure the alias.
 
  fontconfig always returns a font if there's a font on the
  system with the character requested.
  So, it's possible now.

 Doing it one character at a time is guaranteed to give hideous
 results. I have had the unfortunate experience of viewing a
 display in mixed CJK fonts, and I have had many similar

   Well, it depends on what kinds of fonts you have on your
system and the way you specify fonts you want to use. I'm well aware
of 'ransom note-like results when you mix up fonts of many *different*
styles and design principles in a single run of text.  This problem can
be minimized if you are careful in putting together fonts of similar
styles and design principles.

   Anyway, if someone finds it difficult to edit fonts.conf
file and doesn't want to install a minimal set of well-populated
fonts  (sans, serif, monospace, etc), but still wants
as many characters as possible to be rendered, randsom note
is what she deserve to get.


 unfortunate experiences of viewing APL code rendered in random
 math fonts. It is extremely important to a lot of people that
 they be able to specify a font *per language*, without regard to

  Well, *per-langauge* is not a cure-for-all although
on many occasions, it's sufficient.


 the definition of Unicode blocks or old-time code pages or
 ISO-8859-* or any other 8-bit font hack. But we want to do it

  We don't live in that world any more largely thanks to
fontconfig, Xft and Pango.  The age of X11 corefonts
and XLFD hack has gone for good.

 There is, of course, the question of defining the character
 repertoire and rendering rules for a language (which may differ
 substantially from the rules for another language written in the
 same script). To get started, it will suffice if I can say that
 the set of characters in one font that I designate defines the
 repertoire for my use of the language. When we have adequate
 support for more intelligent fonts, we can build in some of the
 rendering rules, also, but in the end language-specific document
 creation will be the job of applications well above the text

   In case of html, 'lang' does the job abd Mozilla supports
it pretty well. Unfortunaely, 'xml:lang' is not yet supported.


 editor level. At some point, explicit repertoire lists will be
 needed, I suppose. Or something else we haven't thought of yet.

   Care to take a look at http://fontconfig.org ?
It includes lang-dependent repertoire list for most, if not all,
of languages listed in ISO 639 (or is it ISO 30xx?)?


   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



fontconfig, alias/pseudo-fonts, Xft (was...Re: supporting XIM)

2003-03-31 Thread Jungshik Shin
Mike FABIAN wrote:
(B
(BPablo Saratxaga [EMAIL PROTECTED] $B$5$s$O=q$-$^$7$?(B:
(B
(B  
(B
(BAlso, Xft allows to define "virtual fonts" created from a list of other
(Bfonts; "Sans", "Serif" and "Monospace" come in standard.
(B
(B
(B
(B~/.fonts.conf
(B  
(B
(B
(BI guess Pablo meant something like the following
(Bbut this doesn't work the way he (and
(BI) wrote it would if only Xft APIs are used(see below). For instance,
(B'monospace' is a 'virtual' font defined as
(B
(Balias
(Bfamilymonospace/family
(Bprefer
(BfamilyLuxi Mono/family
(BfamilyNimbus Mono L/family
(BfamilyKochi Gothic/family
(BfamilyZYSong18030/family
(BfamilyAR PL SungtiL GB/family
(BfamilyAR PL Mingti2L Big5/family
(BfamilyGulimche/family
(BfamilyAndale Mono/family
(BfamilyCourier New/family
(B/prefer
(B/alias
(B
(B
(Band define some pseudo-fonts you want.
(B
(B
(B
(BHow does that work? I didn't know that it is possible to define
(B"virtual fonts" from a list of other fonts using fontconfig/Xft2.
(B  
(B
(BBut I don't yet know a *simple* way to achieve that by using only Xft2.
(BWhen using something like
(B
(B   xft_font = XftFontOpenPattern(dpy, pattern);
(B  
(B
(BI guess you have to call fontconfig APIs(e.g. FcFontSort) directly
(Band do manual break-up of your input text into mutilple pieces
(Bto be rendered by one of fonts returned (by FcFontSort) depending
(Bon their coverage. And, you know this *complex* way, don't you?
(B
(BI always got exactly one font. Are you saying that it is possible to use
(Bmore than one font with a single call to XftFontOpenPattern()
(Bby doing some setup in ~/.fonts.conf?
(B  
(B
(B
(BI think Pablo mistook what fontconfig does for what Xft does unless
(BI'm missing something Pablo knows. I also plead guilty of making
(Ba similar mistake when I wrote abuot working-around a hard-coded
(Bfont name in a Window manager and a theme (e.g. Courier)
(B
(BJungshik
(B
(B--
(BLinux-UTF8:   i18n of Linux on all levels
(BArchive:  http://mail.nl.linux.org/linux-utf8/

diacritic marks for Latin alphabet (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin
Edward Cherlin wrote:

On Monday 31 March 2003 06:38 am, Gaspar Sinai wrote:
  

On Sun, 30 Mar 2003, Edward Cherlin wrote:


Let's try some more.
aeiounx
Not too bad, except that only the first three accents on
each letter are actually displayed, and the dot on the i
isn't removed. 

  Hmm, I can see only two diacritics in Kwrite with Code2000 font.
I found that you appended as many as five of them to each character
in your sample.  What font did you use? Nonetheless, it's a pleasant
surprise that Kwrite does more than simple overstriking.


What do you see in your mail?
  

Yudit currently supports Mark-To-Base and Mark-To-Mark
(2.7.5.beta10) OpenType GPOS and it uses GSUB only for Indic
scripts, ligatures and shaping. Resonable Tibetan (almost
ready) also needs all of these complexities.

If there is an urgent need for this in other scripts I can
take a look at it. 



Not in Latin-alphabet text generally. Writing systems that have 
such needs include Vietnamese, IPA, Math, Polytonic Greek, 
  

  Does Vietnamese need diacritic marks ? Sure, it does, but
I think all it needs are encoded as precomposed so that
they don't need a special treatment other than the conversion between
NFC and NFD.

 
Indic and South Asian are much higher priority than multiply 
accented Latin for mathematicians.
  

   That's why Indic scripts are rather well supported in Yudit now :-)


Is it possible to define all the combinations in GPOS and GSUB
tables in the font at all?



It seems like this is where AAT fonts with state machine are superior to
opentype fonts.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-31 Thread Jungshik Shin
Edward Cherlin wrote:

On Sunday 30 March 2003 11:25 pm, Jungshik Shin wrote:
 

I'm also gonna explore
if it's easier to wed 'pango' with Mozilla  if  gtk2  instead
of gtk is used. That would dramatically improve complex script
handling of Mozilla if possible.
   

Have you looked at SILA? It uses SIL Graphite as the renderer for 
Mozilla.

http://sila.mozdev.org/
 

Yup. I'm aware of it.  At least for now it's only for Windows, though.
However, we may get some valuable insights from the project that can be
applicatble to 'Mozilla-pango' marriage.
Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: supporting XIM

2003-03-30 Thread Jungshik Shin
On Sat, 29 Mar 2003, Edward Cherlin wrote:

 aplications explicitly at present, and automatic support for
 Cyrillic, Greek, Armenian, or Hindi doesn't help Japanese users
 much.

   Automatic support for Hindi? Hmm, do I live in a world
different from yours?  It's NOT CJ(K) BUT Hindi, Tibetan, Arabic, Hebrew,
Bengali, pre-1933 Korean, Polytonic Greek (and Latin/Cyrillic with diacritic
marks for which combining characters are necessary) and other complex
scripts that have the largest wish list. Pango has supports for some
Indic scripts and Thai script, but it doesn't yet support layout of
Greek/Cyrillic/Latin with opentype layout tables.



 out a way to funnel IME input through the normal character input
 calls, we might well achieve CJK support in the majority of
 apps.

  Well right now, the majority of programs in modern Linux
distros DO  work well with CJK IMEs. In case of gtk2 applications,
they also work well with any gtk2 input modules including
those for CJK.  Of course, this doesn't mean that there's
very little to  do when it comes to CJ(K) support, but
I don't share Kubota-san's concern.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-30 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

- a word processor whose menus and messages are translated into your
  native language but cannot input/display text in your native language
- a word processor whose menus and messages are in English but can
  input/display/print text in your native language
Which is better?  The first one is completely unusable and the second
one is unconveinent but usable.
 

I agree with you on this point. That's why I compared the status of KDE 
in 1999-2000
with that in 2003. Back in 1999-2000, KDE/Qt people thought that 
translating messsages
is I18N, but they don't do any more and KDE/Qt supports 'genuine I18N' 
much better now.

Now brief list of examples.

- Xmms cannot display non-8bit languages (music titles and so on).

  Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1  tag 
as long as
the codeset of  the current locale is the codeset used in ID3 v1 tag.  
The problem with mp3
and id3 v1 tag is that id3 v1 tag doesn't have any means of  labelling 
the codeset used
in the tag. Therefore, you can't  view Russian id3 v1 tags (in KOI8-R ) 
and Korean
id3 v1 tags in EUC-KR in a *single* xmms session.  To work around this,
there are three ways ( we discussed this issue a couple of months agon
on this list):

1. convert all id3 v1 tags in your mp3 collection to UTF-8
 2. Give up the idea and launch two separate xmms under two 
different locales
 % LC_ALL=ru_RU  xmms 
 % LC_ALL=ko_KR xmms 

- Xft/Xft2-based softwares cannot display Japanese and Korean at the
  same time while Xft and Xft2 are UTF-8-based, because there are no
  fonts which contain both of Japanese and Korean.  This should not
  be regarded as a font-side problem, because (1) font-style principle
  is different among scripts (there are no courier font for Japanese)
You can use 'alias' in fontconfig  if some programs use 'Courier' 
or 'Arial' instead
of generic fonts names like 'monospace', 'serif', 'sansserif', and so 
forth.

  and (2) such fonts need developers who can design letters all over
  the world.  Pango's approach (changing font according to script)
  is needed.  

 Well,  if Xft2 is used along with fontconfig, there's no such problem. 




- There are many window managers which support themes.  Even if the
  window manager itself is already i18n-ed, some themes cannot display
  non-Latin-1 languages.  This occurs in two cases: (1) when the theme
  specifies a font name (it is very likely) or (2) when the theme
  supplies an origial font.
 In the first case, you can work around the problem rather easily with 
'alias' mechanism
in fontconfig.

 

- There are no lightweight web browser like dillo which is i18n-ed.

I think that w3m-m17n is an excellent lightweight browser that 
supports I18N well.

- FreeType mode of XFree86 Xterm doesn't support doublewidth characters.

  Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect
he'll apply it sooner or later. After that, I'll add '-faw' option 
(similar to '-fw' option).
  

- Ghostscript.  It is known that it can handle Japanese by some
  trick (by localized version?) but it is too complex and difficult
  for me.
 It's not that hard. Most changes made by gs-cjk project have been 
folded back to
the upstream gs.  Moreover, modern Linux distros now come with ghostscript
with all the 'hard' jobs(configurations) already done for you and you 
don't have much to do.

- Even OpenOffice.org 1.0 cannot display Japanese even with Japanese
  add-on package.  I have to configure some font substitution.  Note
  that this can be done only after installation, thus I cannot read
  (translated) messages during installation at all.
 OpenOffice seems to have a serious problem when run under UTF-8 
locale. Under locales
with legacy codesets, it more or less works, but Unix/X11 version 
appears to have to be
overhauled with a new client-based font framework (fontconfig, Xft, 
pango). Its use
of the old server-side font technology makes it slow and ugly.



- Curses-basd softwares.  They must not assume number of bytes is
  same as number of columns or number of characters.  Doublewidth
  and combining character support is needed.
  As I mentioned already,   this is where we need a lot of  works. 
There are a few programs
that work well, though when linked against ncursesw.  One prominent 
example is
mutt.

 

- Perl doesn't have wcwidth().

  Well, there are a couple of Perl packages that let you  query various 
Unicode character
properties so that it should be trivial to write your own wcwidth() if 
somebody
hasn't done it already.

- Text line wrapping.  Chinese and Japanese (not Korean) don't use
  whitespace between words.
 

 I already mentioned this issue. Programs like 'fmt' has to be 
modified, but there's already
an alternative to 'fmt' that supports Unicod linebreaking algorithm.

I feel that CJK people everytime have to keep a watch on softwares
which are already i18n-ed, because i18n support of such softwares

Re: supporting XIM

2003-03-30 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

Perhaps not double-width, but there are plenty of non-ASCII,
non-ISO-8859-1 characters in the Unicode set that should be
interesting to U.S. programmers.
   

This is a good information.  I hope such people will hard-code
UTF-8 support up to two bytes.  Though I didn't find such softwares,
I heard there are such softwares.  We have to continue keeping watch
on i18n implement of softwares
How about em-dash or ligatures such as fi or ffl?  Are they
doublewidth?
 

Em-dash is a valid example, but 'fi/ffl' are NOT. Ligatures should not 
be 'hardcoded' by
those who edit documents, but have to be automatically 'summonned' at 
the rendering
layer. Anyway, other examples include  Euro sign, genuine opening 
quoation marks
and many more that have been mentioned several times by Markus Kuhn on 
this list
before.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: I18nized apps (was Re: supporting XIM)

2003-03-30 Thread Jungshik Shin
Edward Cherlin wrote:

Nadine Kano wrote one, published by Microsoft, which is 
unfortunately very much out of date and out of print. I know of 

Well,  the book is not just outdated but has some critical errors/mistakes
and Microsoft-centrism(that doesn't work well for POSIX system) 
along with useful information. BTW, I believe MS press released an 
update to
the book recently.

Perhaps some of us should get together and pitch the idea to 
O'Reilly. Certainly a HOWTO is in order.
 

 Although it's not exactly the kind you're looking for, CJKV 
Information Processing
would be a useful reference for I18N engineers.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Pango tutorial? (Re: supporting XIM)

2003-03-30 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

Unfortunately, there are no tutorials for Pango.  A developer of Xplanet
and I sent mails to a Pango developers (Evan Martin and Noah Levitt) to
ask that but they think Pango is not intended to be used from applications
 

   Owen Taylor is 'the' Pango developer, isn't he?



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: supporting XIM

2003-03-30 Thread Jungshik Shin
Glenn Maynard wrote:

programmers in X care more about X support than Windows
support (which is very annoying to Windows users, who often end up with
 

old, buggy ports of X software when they get them at all).

off-topic:This is one of many reasons scientific community 
(astronomy/astrophysics for instance)
was one of the earliest groups that quickly embraced Linux. Their main 
toolsets
are all written for X11 and their Windows/MacOS ports were buggy and 
outdated,
but porting them to Linux is a lot easier.

This is actually one advantage of NFD: it makes combining support much
more important.  (At least, it's an advantage from this perspective;
those who would have to implement combining who wouldn't otherwise
probably wouldn't see it that way.)
 

  Another advantage of NFD is the consistency.  In  NFC, some characters
with diacritic marks are represented as precomposed while others are 
represented
with base character + diacritics. In NFD, all characters are represented 
the same
way except for some Korean Hangul Jamos due to 'the' very stupid mistake
of South Korean standard body that requsted the removal of  decomposition
of  cluster Jamos into  sequences of simple/basic Jamos. (Overall,
Korean script handling in Unicode/10646 is among the worst.)

By the way, I just gave lv a try: apt-get installed it, used it on a
UTF-8 textfile containing Japanese, and I'm seeing garbage.  It looks
like it's stripping off the high bits of each byte and printing it as
ASCII.  I had to play around with switches to get it to display; apparently
it ignores the locale.   Very poor.  Less, on the other hand, displays
it without having to play games.  It has some problems with double-width
characters, unfortunately.
 

  Actually, with Owen Talyor's patch posted here about a year and half 
ago(?),  'less' works
pretty well in UTF-8  under UTF-8 xterm.

Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-30 Thread Jungshik Shin
Evan Martin wrote:

(Following the earlier discussion about XIM...)

http://im-ja.sourceforge.net/
is a pretty effective input module for Japanese input in GTK2.
 

And, you can install *along* its side,

 http://sourceforge.net/projects/wenju/  (includes gtk2 input module(s) 
for Chinese : table-based)
 http://kldp.net/projects/imhangul   : Korean gtk2 input module suite

and other gtk2 input modules for other scripts. You can also switch around
various Xkb supported key layouts as you and others wrote with help
of KDE keyboard swticher or Gnome2 keyboard switcher.Besides, if you 
want,
you can still use one of XIM servers you like to use. I'd rather use the 
built-in
XIM server (Compose for UTF-8 locale)  by resetting XMODIFIERS
env. variable (or equivalents in Xresources).

As long as input method is concerned,
this thread is almost a replica of the thread  last Dcember and all these
information was  given then (except for KDE/Gnom2
Xkb kbd switcher and  im-ja in which place a less advanced gtk2 input 
module
for Japanese was mentioned by Owen ).  Is there anything wrong with 
collective memory of this list? ;-)

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-30 Thread Jungshik Shin
srintuar26 wrote:

As long as input method is concerned,
this thread is almost a replica of the thread  last Dcember and all
these information was  given then (except for KDE/Gnom2
Xkb kbd switcher and  im-ja in which place a less advanced gtk2 input 
module
for Japanese was mentioned by Owen ).  Is there anything wrong with 
collective memory of this list? ;-)
   

Well I for one have been placated for now by im-ja. Its precisely
what ive been looking for, and extensive googling didnt root it out.
 im-ja may have not turned up in google, but the archive of this list 
includes
all the necessary information we went over again  the last week
except for  KDE/Gnome2 kbd switcher. Actually, I'm not sure
of my own memory and that may also have been mentioned in
the past.



XIM has been a disappointment for me, and I got tired of using iconv,
rom2hira scripts, a trivial console based canna interface, and
kanjipad for my input needs. (rh8 uses euc-jp for its Japanese
locale, and I refuse to use non-utf-8 locales, but XIM wont work
correctly or stably outside of the euc-jp locale...)
 

Well, you must not have been on this list long enough. Last Nov/December,
I posted how to make RH8 support ja_JP.UTF-8 and ko_KR.UTF-8.
Most of my changes have been fed back to XFree86 and are included
in XF86 4.3. Hopefully, RedHat 9.0 turn on UTF-8 locale for
CJK by default as I urged them to  on several occasions.
BTW, I've been using ko_KR.UTF-8 for about a year now.
Now if only more apps were gtk2 based...
Mozilla and gvim come to mind.
 

 gtk2 patch for vim works very well. Just try 'vim gtk2 patch' and
you'll get http://regexxer.sourceforge.net/vim.  If you're adventurous,
you can try building gtk2-port of Mozilla yourself. It's being worked on.
I'm gonna give it a shot myself soonish.  I'm also gonna explore
if it's easier to wed 'pango' with Mozilla  if  gtk2  instead of gtk
is used. That would dramatically improve complex script handling
of Mozilla if possible.
  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: supporting XIM

2003-03-30 Thread Jungshik Shin
Edward Cherlin wrote:

On Sunday 30 March 2003 06:29 pm, Jungshik Shin wrote:
  

Edward Cherlin wrote:


On Sunday 30 March 2003 03:26 am, Jungshik Shin wrote:

  

I
can't test some of the others myself, and haven't heard any
detailed information on them. I have not found any problems
with diacritics in Latin and Cyrillic.
  

  Well, you do have problems with characters with diacritics
in Latin,Greek and Cyrillic for which
Unicode does NOT have assigned and will NEVER assign separate
codepoints. That's
what I was talking about. There are  tens  , if not hundreds,



thousands, if not tens of thousands. I'm a mathematician.
  


  I know how to multiply, too. It doesn't take a mathematician
to multiply, does it?  :-) The reason I wrote tens/ hundreds
instead of thousands/tens of thousands was that I like to
give the number of combinations that have turned up
in existing documents rather than the number of
all possible combinations.

  

of combinations
(base character + one or more diacritic mark(s)) that can ONLY
be represented by combining character sequences. 



Like this? 
a
It's an a with two accents, and it composes and displays 
correctly in kwrite and kmail, with one accent above the other.

Let's try some more.
aeiounx
Not too bad, except that only the first three accents on each 
letter are actually displayed, and the dot on the i isn't 
removed. Curiously, Yudit doesn't handle multiple accents as 
well as these simple-minded apps do.


 Yudit needs the same change as I proposed for Pango in this mail
and a couple of others. Yudit supports opentype layout table
for several Indic scripts and it needs to do the same for
Latin/Greek/Cyrillic alphabets. SIL has one such font.
Unfortunately, the last time I downloaded it, there's something
wrong with zip and I couldn't try it.
(http://www.sil.org/~gaultney/gentium/index.html)


What do you see in your mail?
  

  I can't tell without knowing what I'm supposed to see.
Anyway, what I see is two diacritics overlapped over
each other instead of taking disjoint 'spaces' alongside
or on top of /below each other.  See 
http://www.columbia.edu/kermit/st-erkenwald.html
for a real life example.

  Didn't I specifically write that Pango does not support
diacrtic marks combined with base characters while Uniscribe
does (although it didn't until very recently)? I know
that xterm and vim support up to two combining characters
and that's how pre-1933 Korean script and Latin/Greek/Cyrillic
diacritic marks are supported by xterm/vim. I guess kmail/kwrite
do likewise. However, that's a kind of  the last resort when you
don't have a better way to do it properly.  Eventually, what
we need is support in Pango and that's filed as
bug 101079 (see http://bugzilla.gnome.org/show_bug.cgi?id=101079)

Other pango bugs I filed (excluding Korean-specific ones)
include :

http://bugzilla.gnome.org/show_bug.cgi?id=101081
http://bugzilla.gnome.org/show_bug.cgi?id=106624

The starting point of this discussion was the inability to use 
Chinese, Korean, and Japanese IMEs in the same locale. I write 
documents in all three languages, and I would do it more often 
if it were actually convenient.


  This is becoming rather frustrating. How many times do I have to write
that it IS possible right now to install all of them and switch
between them in a *single* application (session) running under any
UTF-8 locale of your choice?   Why don't you try installing
all three of them (im-ja, imhangul and wenju ) and fire up
gedit and right-click on the text input area to see what you have?
The very same information was given in last Decemeber and
this thread doesn't add any new information except for
im-ja in place of other less advanced Japanese gtk2 input modules.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-29 Thread Jungshik Shin
Tomohiro KUBOTA wrote:

Hi,

From: Jungshik Shin [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Thu, 27 Mar 2003 18:38:51 -0500 (EST)
 

 That's not a problem at all because there are Korean, Japanese
and Chinese input modules that can coexist with other input
modules and be switched to and from each other. With them, you
don't need to use XIM.
   

...

One point: Many Japanese texts include Alphabets, so Japanese people
want to input not only Hiragana, Katakana, Kanji, and Numerics but
also Alphabets.  I imagine Korean people want, too.  In such a case,
switching between Alphabet (no conversion mode) and conversion mode
has to be achieved by simple key typing like Shift + Space.  

 There are two switchings involved here. One is the intra-module mode/level
switching and the other is inter-module switching.
What you want for Japanese (and correctly guessed Koreans also need) can 
be easily
achieved  by the intra-module mode swtiching method of a single gtk2 
input module.
For instance, all 5 modules included in imhangul Korean gtk2 input modul
suite interpret 'shift-space' as the toggle switch between Korean and 
English
input modes and 'F9' for Hangul-to-Hanja conversion. I don't see any reason
the same cannot be done for Japanese gtk2 input modules.  I believe
there's nothing in gtk2 input moduel framework that prevents a
single input module from supporting multiple 'modes' (or levels) that 
can be switched
around if necessary.

As for inter-module switching, I guess some  more work is necessary.
It seems like the only way to switch to another input module is through
pop-up menu that can be 'summoned' by right-clicking. However,
combined with KDE keyboard switcher (I got to know that gnome2
has a similar utilitiy) that appears to be a simple wrapper over
xsetkeymap, you don't have to right-click very often, I believe.
Another point: I want to purge all non-internationalized softwares.
Today, internationalization (such as Japanese character support) is
regarded as a special feature.  



However, I think that non-supporting
of internationalization should be regarded as a bug which is as severe
 

 I agree and think most, if not all, people on this list agree, too. 
Thanks to
a lot of smart people from all over the world including a lot of 
contributors
like you from Japan, free/open source communitiy has  taken several,
if not a lot more, huge steps forward in terms of I18N  during
the last few years. Back in 1998, when I read Drepper's paper
on I18N in glibc, the problem appeared to be overwhelming. As lately
as 1999/2000, KDE team mixed up L10N and I18N and claimed that
KDE 1 supports CJK while all it actually had was translated messages
in CJK.  Now look what we have. gtk2/gnome 2/pango, KDE3/qt, glibc2,
XFree86, Xft/fontconfig, freetype, _NET_WM extension, ICU,  Perl 5.8,
xterm/mlterm, vim, yudit,  Omega/Lambda, many others I forgot to mention

means users have freedom to choose.  Such a freedom of choice must not
be a priviledge of English-speaking (or European-languages-speaking)
people.  Do you have any idea to solve this problem?
 

No question about that. What do we have to do? Well, just as we have
done so far,  I think we have to keep working as well and as hard as we
can.  I think I18N-awareness and I18N-mind are now widespread
among developers worldwide and  I'm not worried as much about
CJ(K) as you're. However, we still need to go a long way
to (fully) support complex scripts of South Asia, SouthEast Asia,
SouthWest Asia (Middle East) , Korea(Hangul is a complex
script)  and Europe/Africa/North America(yes, Europe !
Latin/Greek/Cyrillic alphabets are complex, too !!)
Of course several Japanese companies are competing in Input Method
area on Windows.  These companies are researching for better input
methods -- larger and better-tuned dictionaries with newly coined
words and phrases, better grammartical and semantic analyzers,
and so on so on.  I imagine this area is one of areas where Open
Source people cannot compete with commercial softwares by full-time
developer teams.
  As some linguists observed, Japanese writing system seems to offer a 
number
of fascinating  opportunities for  linguists/computer programmers to put 
their
mature and immature ideas to test.

How about Korean?
 

 In case of Korean, conversion to Hanja(Chinese characters) is not such 
a important issue
as in Japan. Simple dictionary based word and character look-up appears 
to be sufficient
for most Korean users because they rarely use Hanja. As for Hangul 
input(putting
aside pre-1933 orthography Korean for the moment), there are two major  
keyboard layouts
(like qwerty vs dvorak)  with a few variants, but the situation has been 
stable for more than a
decade.   In other words, there  doesn't seem to be  much room for  
innovation because
Korean input is  not much more complex than  input of 
Latin/Greek/Cyrillic alphabet-based
scripts. 

 Cheers,

 Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive

Re: supporting XIM

2003-03-29 Thread Jungshik Shin
On Sat, 29 Mar 2003, Pablo Saratxaga wrote:

 On Sun, Mar 30, 2003 at 12:37:49AM +0900, Tomohiro KUBOTA wrote:

  However, I am often annoyed by people who think supporting European
  languages is more important than supporting Asian languages

  I don't  think you meant that way, but I found it very annoying
that some people and software use 'Asia' to mean only CJK.
One prominent example is Sun's Staroffice and Openoffice.
That's almost an insult to people of Indian subcontinent, Southeast Asia,
Central Asia, and Southwest Asia.

 Are there such people?

  There might be some,  but as I wrote in my response to Kubota-san,
I18N-mind is much more widely spread than 5 years ago and
I agree to your assesment of I18N in Linux below.

 Note also that, currently, I do'nt agree with you that i18n of programs
 is low; to the contrary, the majority of programs have good to
 very good i18n support.



  How should I call such people?  I know they are never racists in its
  original meaning.

 ethno-centrist is the word you are looking for I suppose.

  If they're from Western Europe, 'Western-Eurocentric' :-)


 Tell me about one single current major program/project that doesn't have
 i18n support (maybe there are, and I'm just not aware of it (probably because
 a modern software without i18n support is not worth it in my eyes).

  One example is mkisofs in cdrtools. It's 'single-byte-centric'
and the project maintainer has yet to accept a patch for multibyte support
(including UTF-8). Sonner or later, I'll send him a new patch in such
a form that he find it hard to leave it aside.

  Other examples include fmt, and other textutils, mc (it sorta works,
but needs a lot of work to be fully I18Nized and UTF-8 friendly), lynx
(one MIME charset at a time is well supported, but it needs multilingual
ability as found in w3m-m17n. I hope major linux distros include w3m-m17n
instead of plain w3m) and Pine (it works fine for a single MIME charset,
but not yet multilingual and screen handling is single-byte centric. My
UTF-8 patch solves only a small subset of these problems). 'less'
still needs more work (Owen's patch is better than my patch
that went into less 37x.)

   Some terminal emulators and terminal-based/-like programs need
to pay more attention to East Asian Width (UTR #1? ). xterm has an
option '-cjk-width' and other programs need a similar option/feature.
Vim needs this. Its current column width cacluation routine is not based
on wcwidth(). (I'll plan to fix this soon.  It's very easy and Markus's
wcwidth and wcwidth_cjk come very handy. It's better to use them than
wcwidth from glibc which is locale-dependent.) gtk2 font selection
widget should optionally offer a way to designate a *separate*
'monospace' font for 'double width'. So does Qt's font selection widget.
It's naive to believe that fontconfig and pango can do the magic for
this case as evidenced by the fact that MS Word under MS Windows
even with  equivalents of fontconfig and pango lets
users select East Asian font separately.


   Full-screen text based programs need to be linked against
ncursesw rather than ncurses or slang (how good is slang's
UTF-8 and multibyte support?) and delegate as many  screen-manipulating
tasks to ncursesw as possible . When used with mutt, ncursesw
appears to work well under UTF-8 locale.

   Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-27 Thread Jungshik Shin



On Wed, 26 Mar 2003, Edward Cherlin wrote:
 KDE has a decent keyboard and IME switcher in the KDE Control
 Module. You can install it on the toolbar and choose your hot
 key combinations from a drop-down menu.

  Thanks for the info. I didn't know KDE has this feature. However,
does it work for switching XIM's as well? It lets me switch among
as many keyboard laouts as I want, but it doesn't look like
it supports switching between XIM's. Hmm. is it time to upgrade my
KDE?

  Anyway, I found gtk2 input module switching very nice and hope many more
gtk2 input modules come standard with popular Linux distros.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-27 Thread Jungshik Shin



On Thu, 27 Mar 2003, Pablo Saratxaga wrote:

 [I Cc: to gnome-i18n as it concerns mainly the gtk2 input]

 On Thu, Mar 27, 2003 at 04:17:58AM -0500, Jungshik Shin wrote:

As mentioned before, this is possible in GTK2 applications.
  Fire up gnome-terminal and right-click in any text input area
  and you'll get a pop-up menu from which you can choose a gtk2
  input module a la Windows.

 But you are limited to only one X input method...
 That is the big problem; it would be much better if it would be possible
 to have *seceral* X input methods, like in yudit.

  That's not a problem at all because there are Korean, Japanese
and Chinese input modules that can coexist with other input
modules and be switched to and from each other. With them, you
don't need to use XIM.  For instance, imhangul gtk2 input module for
Korean(http://kldp.net/projects/imhangul) is much more powerful than
Ami. I haven't tried Japanese or Chinese gtk2 input module, but judging
from the way imhangul works, it should be possible to write Japanese and
Chinese input modules as powerful as, if not more powerful than, Japanese
and Chinese XIM servers. BTW, this also works *along* with Xkb. So, if
you have KDE 'keyboard switcher'(which appears to be a simple wrapper
over setxkbmap and of which feature can be done by setxkbmap in non-KDE
environment.),  you can switch between all gtk2 input modules, XIM (either
Compose or one of XIM servers ) and as many Xkb layouts as you want.


 me (I can only type some accented letters, while with an UTF-8 locale
 and xkb keyboard (trough X input method) I can type much more.

   You meant 'Compose'(the built-in XIM server) by 'xkb keyboard',
didn't you?


 I never use the built-in input of gtk2, as it is too deficient for
 In particular esperanto accented letters, azeri schwa, and others.

   You can just Xkb for what it's easier to type with Xkb than
with gtk2 input modules. You wrote as if there's an inherent limit in
gtk2 input modules, but obviously there isn't.  It only depends on how
well any given module is written and designed.


 But then, I cannot type in japanese...

  There is at least one Japanese gtk2 input module as I wrote above.
You just have to install it because it doesn't come default with
gnome 2.x.

 Well, I don't always use all of them, as I don't speak all those languages;
 but a lot of people may have needs that cover several input methods,
 for example Korean and Japanese, or Japanese and French (something
 almost impossible to do properly right now, if you have Japanese input
 you lost some accents), or Chinese and accented pinyin...

  With gtk2 input modules, you can have all of them.


 gtk2 input methods for translitering cyrillic or other scripts are
 useful, but not required.
 more useful are the methods to type in transliteration for scripts
 that use sillabaries with a wide range of combination (korean, geez,
 inuit-cree, etc.),

   Well, Korean script is not usually classified as a syllabary
although it could be many different things depending on how you look at
it :-). Anyway, if there's a need for them(transliterating input methods
for Ethiopic, Inuit, Korean, etc), somebody has to write input modules
for them.  Perhaps, taking advantage of what's done in yudit would be
a good idea when writing such a input module.


 But there is still missing the ability to use various XIM input methods
 and switch between them.

  It'd be nice to have that feature, but it's not necessary because
scripts that usually require XIM servers can be and are
supported by gtk2 input modules.

   Jungshik Shin



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Perl script to hunt for malformed/overlong UTF-8 sequences

2003-03-18 Thread Jungshik Shin
Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.
 

 It may be a good idea to filter out 'UTF-8' representation of 
surrogate codepoints
(0x0d800 - 0xdfff) as well. That is, the following can be added to 
$utf8malformed

  \xed[\xa0-\bf][\x80-\xbf]

Jungshik





--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Perl script to hunt for malformed/overlong UTF-8 sequences

2003-03-18 Thread Jungshik Shin
Jungshik Shin wrote:

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.
 


 It may be a good idea to filter out 'UTF-8' representation of 
surrogate codepoints

(0x0d800 - 0xdfff) as well. That is, the following can be added to 
$utf8malformed

  \xed[\xa0-\bf][\x80-\xbf] 
In addition, non-characters (0x and 0xfffe in all planes) may as 
well be filtered out.

 \xef\xbf[\xbe-\xbf]|
 [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf]
( and 5 and 6byte ones if you want)





--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: UTF-8 and LaTeX

2003-03-11 Thread Jungshik Shin
Markus Kuhn wrote:

Frank Mittelbach ([EMAIL PROTECTED]) has posted on
2003-01-07 on [EMAIL PROTECTED] the beginnings of a far more
lightweight UTF-8 support for LaTeX within the inputenc framework, which
will hopefully find its way into the next release:
 http://www.latex-project.org/cgi-bin/ltxbugs2html?pr=latex%2F3480

 I'm not sure how far LaTeX can get stretched to support Unicode. It 
appears
that Lambda based on Omega( http://omega.cse.unsw.edu.au:8080)
is one of better ways, if not the way, along with true/opentype fonts and
dvi drivers like  dvipdfmx(http://project.ktug.or.kr/dvipdfmx) to get 
Unicode fully 
supported. 

Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: UTF-8 Editors? (Was XML and tags)

2003-02-22 Thread Jungshik Shin

On Sat, 22 Feb 2003, Roozbeh Pournader wrote:

 On Sat, 22 Feb 2003, Edward H Trager wrote:

  It turns out that the version of vim that I have does indeed work under
  xterm for an assortment of LTR languages (Indian languages not tested),

  It wouldn't work for Indic scripts because xterm does not support
Indic scripts (although it supports Thai). It's not even clear what
VT100/220 terminal emulators should do for them.

  but not Arabic (the only RTL language tested)

 Arabic is not in vim yet. They are putting it in now that we're talking,
 and there have been a lot of discussions on something called 'cream' that
 is a vim distribution that has included the Arabic patch.

  You meant  a standalone-gui vim (e.g. gvim) as opposed to vim running
inside a terminal emulator, didn't you? Without RTL
scripts supported by the term. emulatore it's running under, I presume
that it'd be very hard to support Arabic in vim.  BTW, there's a port of
gui-based vim to gtk2(and pango) which reportedly supports RTL scripts
See http://www.opensky.ca/gnome-vim/todo.html. The latest patch
is not the one linked there but you shuold get it at
http://regexxer.sourceforge.net/vim.

   Jungshik Shin


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



mutt and ncursesw

2003-02-18 Thread Jungshik Shin



On Tue, 18 Feb 2003, Nikolai Prokoschenko wrote:

 On Tue, Feb 18, 2003 at 03:57:30AM -0500, Glenn Maynard wrote:
   mutt from Debian doesn't have any problems at all!
  Debian has a mutt-utf8 package that's compiled against ncursesw.

 Not quite - it's some kind of additional packages - maybe it includes just
 the updated binary, I don't really know or care - it works!

  Last time I checked, mutt compiled against the ordinary ncurses
(as opposed to ncursesw) does NOT work for characters with East
Asian width of 'full'. You may get an impression that it works
because you use it only for chars. with East Asian width of 'half'.
For CJK, compiling mutt against 'ncursesw' is a must.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mp3-tags, zip-archives, tool to convert filenames to UTF

2003-02-17 Thread Jungshik Shin



On Fri, 14 Feb 2003, Jungshik Shin wrote:
 On Fri, 14 Feb 2003, Nikolai Prokoschenko wrote:

  On Fri, Feb 14, 2003 at 07:01:56PM +0100, Helge Hielscher wrote:
 
   1) I have some mp3-Files with ID3-Tag, most of these files use the
   ISO-8859-1 encoding, but some use a russian encoding. Which programms
   can display the russian ID3-Tags? I have tried XMMS, but with no
   success.

   If you have a mix of mp3 files with id3v1 tag in ISO-8859-1
 and other mp3 files with id3v1 tag in KOI8-R, the only way to display
 both kinds of tags correctly *simultaneously*(in a single xmms
 session) is to convert both tags to UTF-8 and run xmms under UTF-8 locale.

  One problem with this  is that most portable mp3 players in the
market can't handle UTF-8 although they support a dozen or more
languages. Consequently, you may have to reconvert id3v1 tags
in your mp3 files if you need to store them in portable
mp3 players. They shpport multiple languages by assuming that
there's a one-to-one correspondence between languages and
encodings. This is plainly wrong, but there's not much they
can do given that id3v1 tag does not have any means of indicating
which encoding is used and for the vast majority of mp3 files
circulated and made on the net the aforementioned one-to-one mapping
is valid.

 BTW, id3v2 tags don't have this problem.

  We can just hope that id3v2 will be widely used soon and
a new generations of mp3 portable players will support it.

  BTW, a number of PDAs, mobile phones and other devices
might share the problem arising from the misguided assumption that
languages/scripts and encodings are tightly bound to each other(the
same is true of stupid web mail services like Hotmail, Yahoo mail,
etc). Hopefully, more wide use of Linux in those devices and better
UTF-8 support in Linux will change the situation.


  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: dos2unix and UTF-8 BOM

2003-02-17 Thread Jungshik Shin


On Sun, 16 Feb 2003, Roozbeh Pournader wrote:

 I was thinking about the annoying BOM-like sequence that Windows 2000's
 and XP's Notepads are putting at the beginning of UTF-8 files. The byte
 sequence EF BB BF that's invalid as a header/signature in Unix UTF-8.

 Shouldn't 'dos2unix' be patched to also remove this sequence?

  That would be useful. However, that doesn't work very well if multiples
files are fed to it (e.g. 'cat a b c | dos2unix'). And, that's why
we all hate UTF-8 BOM ;-).

  How about these?

 Incidentally, it just occurred to me that  ftp/ssh clients may offer an
user-configurable option for the  automatic removal of  'UTF-8 BOM' at
the beginning of a text file in UTF-8 when moving files from Windows to
non-Windows platforms (Unix/Unix-like OS and MacOS). The same is true
of Kermit (Frank, are you here?). All those tools can be configured
to translate between three (and nowadays even more?) EOL conventions,
CF/LF/CR,LF for text files. Then, the automatic removal(and addition if
that's regarded as necessary) of UTF-8 BOM at platform boundaries
would be as useful.

   As for web servers, a configurable option can be added to remove
UTF-8 BOM at the beginning of text/* files(they serve). For instance,
it's easy to write a simple module for Apache(used at Unicode.org web
site) to do that.

   VFAT, NTFS and  FAT for Linux can be modified in a similar way.
And, editors like Vim (which automatically detects EOL used in
text files) can do the same.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: redhat 8.0 - using locales

2003-01-11 Thread Jungshik Shin


On Fri, 10 Jan 2003, Markus Kuhn wrote:

 strongly prefered that locale names do not use a country name at all,
 unless it is necessary to distinguish between countries. The only excuse
 to do so is usually the currency field, which nobody uses anyway and

  LC_COLLATE is sometimes region/country dependent. For instance,
ko_KP and ko_KR have different collation rules (although I wish
there were a common set of rules shared by ko_KR and ko_KP).
In addition, differences between zh_* in LC_MESSAGES are not
trivial.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: hanzi vs kanji

2003-01-04 Thread Jungshik Shin


On Fri, 3 Jan 2003, Maiorana, Jason wrote:

 Can we please maintain the distinctions between
 1. language,
 2. script, and
 3. typeface 'category' or other typeface differences.

 Thats really the question: Is the difference between
 Hanzi and Kanji more one of typeface or of script.

 I would argue that it is a real script difference,

   I strongly disagree with you on this point.
Most people on the Unicode list would agree with
me. If they're different scripts, CJK Unification
should be overthrown right away.


 but it is typically implemented as a typeface
 difference. A character in these scripts do have
 a precise set of radicals, stroke order, and
 proportion.

   This is only the case if you regard anything
other than what Japanese MoES(Min. of Education and Science) standardized
as 'non-Japanese'.  My grandfather, father and I(Koreans) could write
a single Chinese character with different stroke counts and sometimes
even differently looking radicals, but all of us know what we mean.


 (Stylization is something applied
 afterwards, deviating from the script norm.)

   Who has the final say in the script norm?
I don't want Korean MoE(Min. of Education)
to tell me to change the way I write
some Chinese characters. My grandfather would
get enraged if  some ignorant beuraucrats
in Seoul wanted him to change the way
he writes.



 It is certainly possible for some to overcome this
 difference, and read their own language despite
 its being in another script, but that does not
 prove that they are identical scripts.

   Neither does it prove that they're different
scripts.


 The difference between fraktur and arial however,
 is purely one of typeface, and seems relatively
 trivial.

   If it's trivial, the diff. across CJK glyph
variants is far far far  more trivial.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Japanese Input under RH8

2002-12-13 Thread Jungshik Shin


On Fri, 13 Dec 2002, Mike FABIAN wrote:
 Jim Z [EMAIL PROTECTED] さんは書きました:

  I tried your tip to bring up kinput2
 I.e. you tried

  export XMODIFIERS=@im=kinput2
  LANG=ja_JP LC_ALL=ja_JP kinput2 -xim -kinput -canna 
  LANG=en_US.UTF-8 LC_CTYPE=ja_JP.UTF-8 program...

  I thought you had written that the following also works with
a new kinput2 (suppose LC_CTYPE/LC_ALL is not defined.) and that
might have been what Jim tried.

   export XMODIFIERS=@im=kinput2
   LANG=ja_JP.UTF-8 kinput2 -xim -kinput -canna
   LANG=ja_JP.UTF-8 program-where...

Actually, I've just tried it and it worked.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [Fonts]Re: Xprint

2002-12-11 Thread Jungshik Shin
On 11 Dec 2002, Juliusz Chroboczek wrote:

 Sorry for mis-reading your mail, then.

  No problem :-)

 JS   As for complex script rendering, it's possible...

 You'll doubtless agree with me that what you're describing are a
...
 for decades now -- it's high time to move on.

  Yes, I agree with you, but somebody needs to do the work.
Actually, the most difficult part may  not be programming but may be
getting/making some intelligent fonts (opentype or AAT) for complex
scripts. For Indic scripts, things are going pretty well and the number
of freely available opentype fonts for Indic scripts are increasing. For
Korean, it's not so good as I wrote before. I have yet to see a single
free opentype font.

  BTW, you'll be surprised to read comments made by some people at
http://bugzilla.mozilla.org/show_bug.cgi?id=144663. They want
to kill PS module in mozilla in favor of Xprint.


 JC I'm a little bit suspicious about their choice to use Type 42 CIDFonts

 JS Given that truetype fonts are much easier to come by than genuine
 JS CID-keyed fonts for CJK (which is also true of truetype fonts vs PS
 JS type 1 fonts for European scripts although to a lesser degree), I guess
 JS the choice is all but inevitable...

 I may have misunderstood something, but last time I checked the
 approach was to use Type 42 CIDFonts *only*.  These are currently a
 fairly rare beast (only supported since version 3012, if memory serves).

 I also thought that's the case. However, Brian Stell changed the plan
(see http://bugzilla.mozilla.org/show_bug.cgi?id=144663. ) and he's now
gonna use type 8 (neither type 11=what you're calling type42 CIDFont =
CIDFont type2 nor type 42). What's type 8 font, btw?


 JC [using Type 42 CIDFonts] will require many users to rasterise
 JC everything with ghostscript on the host, with all the ensuing
 JC performance and printing quality issues.

Because you wrote the above, I thought that you had reservation about
doing everything on the host side regarding printers as dumb devices which
may sacrifice the printing quaility. I also thought that you prefer to
leave as much as possible for PS printers to take care of. That's why I
didn't even mention the most certain way to produce portable PS output
(type3 bitmap) and I wrote about the percentage of end-users owning
PS printers.

 Conversion to Type 1 fonts works everywhere, gives excellent results,
 and the code is readily available (ttftpt1).  Finally, if everything

   Does this conversion code also work for large CJK ttf fonts(with more
than 256 glyphs)? Or, does it also support conversion to composite
font(OCF?)?

 As you see, I am not arguing against support for CIDFonts; I'm merely
 stating that making Type 42 CIDFonts the only download format for TTFs
 makes me er... suspicious.

  I'm not against producing portable PS, either :-).  However,
I think the portability of PS output doesn't matter much considering
the way printing is handled these days in Unix/Linux.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: cxterm cut/paste: COMPOUND_TEXT, UTF8_STRING?

2002-12-09 Thread Jungshik Shin
On Mon, 9 Dec 2002, Tony Laszlo wrote:

Hi,

 I found this 1999 post in the mozilla-i18n archives from Jungshik.
 http://www.geocrawler.com/archives/3/113/1999/7/150/2441628/

 I seem to be having a similar issue, at the moment, with Chinese
 copied from cxterm and pasted into Mozilla (or yudit, or an mlterm
  window). RH7.1, latest Mozilla, latest yudit, kde.

  As I wrote there, cxterm and hanterm are to blame because
they violate X11 ICCCM.  Mozilla, yuidt,mlterm and kde are doing just
what they're supposed to do. (I mentioned a work-around that may be
implemented by 'programs on the receiving end' in my posting, but I
think that's not a good idea.) Mozilla has since implemented UTF8_STRING.
'The' way to solve this problem is to fix cxterm and hanterm to support
UTF8_STRING and COMPOUND_TEXT. kterm(Kanji term) and rxvt(cjk) support
COMPOUND_TEXT and  mlterm and xterm(XFree86)  support both.

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Input under RH8

2002-12-07 Thread Jungshik Shin



On Fri, 6 Dec 2002, Maiorana, Jason wrote:

 First, thanks to Jungshik Shin  Mike FABIAN for your
 replies.

 You're welcome :-)

 I surmise that the current state of RH8 is that it is not
 yet suitable for entry of all languages simultaneously.
 (flaws in XIM itself being part of the problem)

 You're right. You can't do MS Windows/MacOS style IME
switching, yet, in all applications.


 I can probably setup some scripts to pop up a gedit in a
 given mode, but, with the exception of VIQR and Korean,
 I cannot yet graphically switch around to any input method
 with the version of gtk2 that comes with rh8.

   Gtk2 as shipped in RH8 has Thai(broken?), Tamil,
Cyrillic(transliterated), Innuikitut, IPA, Tigrigna-Ethiopian,
Tigrigna-Eriterian,  and Amharic input modules in addition to XIM,
Vietnamese, *broken* Korean(KSC5601) input module. For Korean, you'd
better install 'imhangul' input module at http://imhangul.kldp.net. You
can download the source by clicking 'download' in red and install it by
following the instruction in the gray box below the link for download.
If this is the first time you install 'imhangul', you have to run 'make
install' twice (it's due to a bug to be fixed.)

  You can also make use of Xkb. With its support of multiple
levels, you can add yet another 'input method' to your repertoire of input
methods accessible in gedit(a gtk2 application). As for Xkb, refer to
XFree86 I18N archive.

 Hopefully, in the near future, RH will ship all utf-8
 locales by default, and gtk2 will have a XIM wrapper
 that allows access to any input method on the system
 from any language locale.

  Alternatively, 'meta XIM server' (as implemented at the client level
by Yudit and mlterm) that lets users switch between multiple XIMs will
be handy. Then, it can be used for non-gtk2 applications as well as
gtk2 applications.

 BTW, has anybody heard of gtk2 input modules for Chinese and Japanese?
A quick googling didn't turn up anything.

   Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: UTF-8 wakeup call

2002-12-07 Thread Jungshik Shin
On Sat, 7 Dec 2002, Kent Karlsson wrote:

  The mappings used are at least also from the RFC 1345 (recode uses that)
  or the IS 15897 which uses many if the same names and mappings.
  Specifically I have seen that Linux is *not* using the Unicode data
  because of copyright issues.

 Hmmm.  From http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html:

   Limitations on Rights to Redistribute This Data

   Recipient is granted the right to make copies in any
   form for internal distribution and to freely use the

 I don't see this as restrictive for use in Linux.  I'm sure Unicode
 consortium would like to see its data being used also in open source

   glibc 2.x may not use them, yet. However, glib(and other libraries
built on top of it) indeed makes an extensive use of Unicode data files.
So do Perl, Yudit, Mozilla and other free/opensource programs/projects
that run on Linux.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: Japanese Input under RH8

2002-12-06 Thread Jungshik Shin

On Fri, 6 Dec 2002, Maiorana, Jason wrote:

 thanks for the tips, but what I really wanted was use japanese/other
 languages
 input methods, but not be in a ja_JP locale. (just the default local
 en_US.UTF-8)
 (Also I was hoping it could be done in an application that was already
 running,
 for example I would start off in VIQR, then maybe do some korean input,
 then
 switch to XIM/kinput2/canna, all in the original gedit window...)

  You're talking about two different things here. One is XIM
and the other is gtk2 input modules. Gtk2 input module mechanism (that
you bring up by 'right-clicking' in gtk2 input widget area) lets you do
what you want. It also supports XIM as one of supported 'modules'. Under
en_US.UTF-8 locale, XIM selected is (unless XMODIFIERS is set to
@im..)  the default built-in XIM which is Compose mechanism. Compose
mechanism is pretty powerful for alphabetic scripts although it's not
so useful for Japanese and Chinese.


 im curious why I would set the LC_CYPTE to ja_JP.UTF-8,
 why would that be any different than en_US.UTF-8 when the
 LANG is en_US.UTF-8. I'm not worried about japanese collation
 i'd prefer to use a default unicode collation.

  Unfortunately, most XIM servers are written in such a way
that they can only be launched under a certain locale.  However,
gtk2 input module mechanism can be used to achieve what you want(
switching between any number of different input modules in any UTF-8
locale). Somebody has to write (a) gtk2 input module(s) for Japanese
(if it hasn't been written yet. There are a very powerful set of Korean
input modules for gtk2 all based on U+1100 Hangul Jamos alone) Then, you
can use it regardless of the locale you're in. This is great as long as
you use gtk2 applications. For non-gtk2 applications, it doesn't work,
though and there's still a need to write a 'wrapper XIM' server that
lets users to invoke multiple XIM servers at will. There are a couple of
projects going on in that direction. There's also a 'next generation input
protocol' for X11 and other platforms. (look around http://www.li18nux.org).
You can find more details in XFree86 I18N mailing list archive.


 Im curious, why do you suggest that kinput2 should be run with
 eucJP as its startup encoding? Does it have bugs if that is not the
 case?

  I guess kinput2 was written that way. That was also the case of
Korean input method Ami without my patch. Because launched under
ko_KR.EUC-KR, it  can't be used to input the full repertoire of Hangul
syllables in Unicode, I patched  it to be launchable under  under
ko_KR.UTF-8 locale.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Why doesn't Linux display Japanese file names encoded in UTF-8?

2002-12-06 Thread Jungshik Shin



On Fri, 6 Dec 2002, Jim Z wrote:

Jim,

 However, there are issues. After those changes
 when I logged into Japanese EUC locale, everything
 is displayed in English. :( So was for Japanese
 UTF-8 locale. Is that because the system couldn't
 find the resources?

 Have you checked what's in /etc/sysconfig/i18n and ~/.i18n?
Why don't you make both of them clean and see what you get?
Also make sure that you installed kde-i18n-Japanese package
for KDE?  In my case, both Gnome and KDE came up nicely in
Japanese.

 I didn't check and made sure
 that the locale.dir was modified (I'll check again).
 Also, in UTF-8 for Japanese mode, there is no
 Japanese input (Shift-space bar).

 As already noted by others, kinput2 has to be launched under
ja_JP.EUC-JP. Certainly, this has to be fixed.

 In general, looks like UTF-8 works on Lunix for CJK;

 There are still some issues (input methods as you found,
localized man pages).  Localized man pages are mostly in legacy encodings
and it's hard to figure out how to make them work in UTF-8 locale(if
at all possible). 'man', 'less' and 'groff' all do things differently
(when it comes to interpreting LC_* and LANG environment variables) and
they interact with each other in a intricate way. At least, I think 'man'
has to be fixed to either call setlocale(LC_MESSAGES,...) directly or
to use the SUS-provisioned order of resolving LC_*/LANG env. variables.
(i.e. 1. LC_ALL 2. LC_ 3. LANG)  At the moment, even 'LC_ALL=C man
xyz' doesn't give me man pages in English, let alone 'LC_MESSAGES=C'
when LANG is set to ko_KR.UTF-8.  Note that LANG should be given the
lowest precedence in the locale resolution and LC_ALL should be at the
top. Certainly, man doesn't honor that order.

  A couple of years ago, we discussed how to tag(if we decide
to tag them) the encoding used in man pages, but it got nowhwere. A
reasonable approach appears to be to conver them all to UTF-8 (assuming
groff UTF-8 support will come along soon).


 however, there is no way for general users to do what
 they intent to do.

  According to what I heard on this list, SuSe 9.1
offers UTF-8 locales for all languages as an alternative to traditional
encodings so that SuSe users should have no problem there.
Mandrake 9.0 seems to do it, but it doesn't work out of box
(I have to make some modifications) as far as I can tell.

 Your help is appreciated and I would like to see your
 fixes get into near future builds so all can benefit.

  My changes to XFree86 have gotten into CVS of XFree86 so that
I guess it'll be included in upcoming 4.3.0 release. With increasing
use of Xft/fontconfig and client-side fonts, the importance of
my patch(to X11 locale) will diminish.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: Japanese Input under RH8

2002-12-06 Thread Jungshik Shin
On Fri, 6 Dec 2002, Jungshik Shin wrote:
 On Fri, 6 Dec 2002, Maiorana, Jason wrote:
  im curious why I would set the LC_CYPTE to ja_JP.UTF-8,
  why would that be any different than en_US.UTF-8 when the
  LANG is en_US.UTF-8. I'm not worried about japanese collation

   Unfortunately, most XIM servers are written in such a way
 that they can only be launched under a certain locale.  However,

  BTW,  I didn't mean that kinput2, Xcin and Ami cannot
be modified to work under en_US.UTF-8 locale. They can, but their
dependency on fontset make them work less optimal than under their
'native' locales. I guess we  have to give up 'stretching' old XIM
protocol and had better focus on a new IIIMF(Internet Intranet Input
Method Framework: http://www.openi18n.org/subgroups/im/IIIMF.
Li18Nux.org changed the name to become OpenI18N.org) or gtk2 input modules
or similar mechanisms. MS Windows has something called TSF(Text Service
Framework) which appears to be very flexible. IMHO, XIM is too old to
be on par with likes of TSF. IIIMF is at a far better position for that
than XIM.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: filename and normalization (was gcc identifiers)

2002-12-05 Thread Jungshik Shin



On Wed, 4 Dec 2002, seer26 wrote:

  is to insist that  11,172 modern precomposed syllables be encoded
  in Unicode/10646. Next biggest blunder they made is to encode tens
  of totally unnecessary cluster-Jamos when only 17+11+17+ a few more
  would have been more than sufficient. Next stupid thing they did is


 Would Chinese be in a similiar situation if it the radicals were
 combining characters, and any combination of them could in theory be
 a valid character?

  Possibly. However, radicals are only a small subset of 'components'
used in Chinese characters. You need to have a lot more 'components'
than radicals listed in any Chinese character dictionary.

 In practice, of course, a normal person would use
 far fewer than 10,000 distinct characters.

  Do you think anybody  wants a character set standard(like
Unicode) to specify the list of sequences of Latin/Greek/Cyrillic
alphabets that are allowed? Imagine  that you can use 'ab, eb, ob, se,
ce' but cannot use 'sce, gh, ph' That's what encoding a fixed set of
precomposed  syllables does for Korean alphabet.

 Have you ever needed a character that wasnt among the 11,172 precomposed
 ones?

  Sure! See http://jshin.net/i18n/korean/hunmin.html
or http://jshin.net/i18n/uyeo.html. 11,172 precomposed syllables don't
include any pre-1933 orthography syllables.  The set doesn't include
modern incomplete syllables(which high school Korean teachers need to
teach Korean grammar), either. Basically, it was a very stupid idea
(and a vast waste of codespace) to enumerate possible combinations of
alphabetic letters.  Just encoding alphabetic letters should be more than
enough. I wish Korean Nat'l Standard body had been half as competent as
as its counterpart in India. ISCII (which ISO 10646/Unicode copied almost
verbatim) did a great job of encoding only what's absolutely necessary for
Indic scripts. And, that was in early 1990's when no intelligent modern
rendering engine and font were in sight. They, however, had a foresight
that encoding hundreds or thousdands of 'presentation forms' for each
of Indic scripts was not a way to go and that eventually intelligent
and advanced fonts/rendering engine would come out. They were right and
nowadays Indic scripts are pretty well supported by Pango, Uniscribe,
ATSUI, and Graphite. It may take a little more while to have opentype
fonts in public domains for all Indic scripts, but they're coming...

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: filename and normalization

2002-12-05 Thread Jungshik Shin
On Wed, 4 Dec 2002, Werner LEMBERG wrote:

   the manpage was not using a regular ascii '-', but instead one of
   the HYPEN, or EM_DASH things (Which is why i HATE them).
 

  you can configure the way your 'man' works in man.config.  You can
  set NROFF to use '-Tascii -man' and you get 'ASCII approximation' of
  real em_dash, hyphen etc so that you can copy and paste and search

 A better temporary solution is to add the following to man.local:

   .if '\*[.T]'utf8' \
   .  char \- \N'45'

  Thanks. It worked great. Neither of Mandrake 9 and RH 8 has this
in man.local. I guess they should.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: gcc identifiers

2002-12-04 Thread Jungshik Shin

On Wed, 4 Dec 2002, Keld Jørn Simonsen wrote:

 On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote:

  Maybe a --normalize-utf option to the linker might be a good idea, but
  it should be an option, IMO.

 First of all, the standard does not refer to Unicode, but to 10646.
 And the C standard does not use Unicode normalization.
 There is a list in the ISO C standard of 10646 characters that are
 allowed in identifiers, and these do not have alternate representations.

  Thank you for the note.

  I found FCD of ISO/IEC 9899 1999 (N2794 at
http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug.,
1998.  In Annex I 'Universal Character names for identifiers'(page
487. If you use Acroread  to view PDF version, it's 499), a set of
characters allowed are listed. (More or less identical list is found at
http://std.dkuug.dk/TC1/SC22/WG20/docs/standards#10176) Basically ISO C99
seems to avoid problems arising from multiple representation issues by
allowing only precomposed characters in identifiers(is there any change in
this regard in the finally approved ISO/IEC 9899 1999?) Keld's statement
that they do not have alternate representations is not right.
If that's the case, characters like 'Latin Small Letter with Macron'
or 'Hangul Syllable Gga' for which there are alternate representations
should not be present in the list, but they are listed as allowed.

  What ISO C99 seems to do is to shift the burden of normalization to
editors or whatever tool used by programmers to edit source files from
compilers and linkers.  That's fine(editors can do that) and is perhaps
a wise decision (preventing potential troubles from propagating thru
a compiler-linker chain at the earliest stage by issuing an error and
stopping compilation), but there's a little trouble with allowing only
precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
any more precomposed characters which can be represented with exisitng
base characters followed by one or more combining characters. However,
'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
identifiers  so that 'any character' that's not encoded as a precomposed
form can't be used in identifiers. Some people would resent not being able
to use 'their characters' in identifiers and may use it to make a case for
encoding precomposed forms of theirs in ISO 10646.  How about references
to filenames (as in '#include directive') with combining diacritic
marks that are parts of characters NOT encoded in precomposed form?
Aha, they can use '\u, or \U)...

  Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin



On 4 Dec 2002, H. Peter Anvin wrote:

 By author:Jungshik Shin [EMAIL PROTECTED]

All right. That's what the *current* SUS/POSIX says. However, that
  is hardly a solace to a user who'd be puzzled that two visually
  identical and cannonically equivalent filenames are treated differently.

 There *is* no way to solve this problem.  You have the same kind of
 problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
 LETTER ALPHA.  However, if you attempt normalizations you *will*

  U+0041, U+0391, and U+0410 are NOT  equivalent in any Unicode normalization
form. They're not even equivalent in NFK*.  Note that I didn't
just say visually (almost) identical but also modified it
with 'canonically equivalent'.

 introduce security holes in the system (as have been amply shown by
 Windows, even though *its* normalizations are even much simpler.)

  Therefore, your exmaple cannot be used to show that there's a security
hole(unless you're talking about applying normalization not specified
in Unicode) although it can be used to demonstrate that even after
normalization, there still could be user confusion because there are some
visually (almost) identical characters that would be treated differently.

  A better example for your case would be U+00C5(Latin captial
letter with ring above) and U+212B(Angstrom sign) or U+004B and
U+212A(Kelvin Sign). They're canonically equivalent.

 available to the user (ls -b or somesuch.)  Attempting
 canonicalization is doomed to failure, if nothing else when the next
 version of Unicode comes out, and you already have files that are
 encoded with a different set of normalizations.  Now your files cannot
 be accessed!  Oops!

 I might agree that normalization is not necessarily a good thing.
However, your cited reason is not so solid. Unicode Normalization form is
**permanenly frozen** for exisitng characters. And, UTC and JTC1/SC2/WG2
committed themselves not to encode any more precomposed characters that
can be represented with existing base char. and combining characters. If
you're not sure of their committment, perhaps using NFD is safer than
using NFC. Hmm.. that may be one of reasons why Apple chose NFD in Mac
OS X.

  BTW, without changing anything in Unix APIs and Unix filesystem(which
are not desirable anyway), shells 'might' be a good place to
'add' some normalization (per user-configurable option at the time
of invocation and with  env. variables)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: filename and normalization

2002-12-04 Thread Jungshik Shin


On Wed, 4 Dec 2002, Maiorana, Jason wrote:

 As a side-note, I copy/pasted a command line flag from a RH8.0
 manpage back into the console, and tried to execute the command.
 It failed, and gave me usage. The reason, I discovered, is that
 the manpage was not using a regular ascii '-', but instead one
 of the HYPEN, or EM_DASH things (Which is why i HATE them).

  I discovered that a long time ago and gave up copy'n'pasting from
man pages.  I began to write that those characters should not be used in
man pages, but then I came up with a couple of argument against my own and
didn't send a message here. One of them was that you can configure the
way your 'man' works in man.config.  You can set NROFF to use '-Tascii
-man' and you get 'ASCII approximation' of real em_dash, hyphen etc so
that you can copy and paste and search backwad/forward for command line
options. Another was that man page is not only for screen viewing but
also for print out. When printed out, genuine hyphen and em dash look
certainly better than their ASCII approximation.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin



On Wed, 4 Dec 2002, Maiorana, Jason wrote:

 Normalization for D has some serious drawbacks: if you were to try
 to implement, say vietnamese using only composing characters,
 it would look horrible. The appearance, position, shape, and size
 of the combining accents depends on which letter they are being
 combined with, as well as which other diacritics are being combined
 with that same letter.

  What's your point here? NFD or NFC, they should be rendered
identically by 'modern' rendering engines.  You're making an assumption
that the way characters are rendered depend on in which NF they're
stored/represented. At least in principle, that should not be the case.
Even a not-so-capable renderer(e.g. xterm with bitmap font or
Linux console) can do a internal normalization to fit their need
and capability.

 NF-C is most appropriate for some scripts, and NF-D may be desirable
 for others. It would be better,

  What are your criteria? Again, rendering? As I wrote above,
that has nothing to do with NFs used.

 IMO, if unicode would get rid
 of both forms, and simply support one representation of each
 possible glyph. (No combining characters unless they are the ONLY

  'glyphs'? Coded character set is not about glyphs but about
characters.

 way to represent a particular glyph) (Actually, no combining chars
 at all would be best, because its simplest. Why not just assign
 more code space to the langs that need it?)

 Do you want to give 1.5 million (and more) code points to Korean script?
Why don't you propose your idea to UTC and ISO/IEC JTC1/SC2/WG2?
Either your mailbox will be bombarded with a lot of emails
or you will be greeted with 'dead slience'.

 If you have a filesystem that forces NF-D, then I would say its a
 poorly designed filesystem that makes such choices, because its
 way to low level to care about things like that. Filenames should
 be string of bytes, and the UI-conventions should allow one
 to distunguish. If you are on a NF-C==canonical system, and you
 mount such a filesystem, you should see bakemoji, and not
 any translated normalization form.

  Why bakemoji? No matter what NF are used in filenames, they should
be just rendered as they should be rendered by any Unicode-compliant
rendering engines.  This behavior is more  consistent with your view
that filenames are strings of bytes than showing 'bakemjoi'.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin


On Wed, 4 Dec 2002, Maiorana, Jason wrote:

 If characters are ever introduced which have no precomposed codepoint,
 then it will be difficult for a font to normalize them to one
 glyph which has the appropriate internal layout. The font file itself
 would then have to know about composition rules, such as when
 X is composed with Y then Z, then use this glyph XYZ which has no
 single codepoint in unicode.

 Have you ever heard of Opentype and  AAT fonts? Modern font
technologies and modern rendering engines (Pango, AAT, Uniscribe,
Graphite) can all do that. Otherwise, how would Indic scripts be used
at all?  What you describe above is done by everyday by Pango,
Uniscribe and AAT/ATSUI, Graphite.


 For that reason, I dont like form D at all.  I wonder how much space
 it would take to represent every possible Jamo-combination, then just
 do away with combining characters alltogether...

  No way!!  The biggest blunder ever made by Korean nat'l standard body
is to insist that  11,172 modern precomposed syllables be encoded
in Unicode/10646. Next biggest blunder they made is to encode tens
of totally unnecessary cluster-Jamos when only 17+11+17+ a few more
would have been more than sufficient. Next stupid thing they did is
to remove compatibility decomposition between cluster Jamos and basic
Jamo sequences although they should be canonically(not just compatibly)
equivalent.  Now, you're saying that all possible combinations of them
be encoded. How many? It's __infinite__ in theory. In practice, it could
be around 1.5 milllion.  That's more than the total number of codepoints
available in 20.1 bit coded character set which is ISO 10646/Unicode.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin


On 4 Dec 2002, H. Peter Anvin wrote:

 By author:Jungshik Shin [EMAIL PROTECTED]

  How many? It's __infinite__ in theory. In practice, it could
  be around 1.5 milllion.  That's more than the total number of codepoints
  available in 20.1 bit coded character set which is ISO 10646/Unicode.

 And people give me funny looks when I tell them not to trust the 20.1
 bits forever statement from Unicode, just as I didn't trust the
 earlier 16 bits forever statement...

 Whether you're convinced or not, it's not only in Unicode but also
inscribed in ISO 10646.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin



On Wed, 4 Dec 2002, Maiorana, Jason wrote:

  For that reason, I dont like form D at all.  I wonder how much space
  it would take to represent every possible Jamo-combination, then just
  do away with combining characters alltogether...
   No way!!  The biggest blunder ever made by Korean nat'l standard body
 is to insist that  11,172 modern precomposed syllables be encoded
 in Unicode/10646. Next biggest blunder they made is to encode tens
..
 available in 20.1 bit coded character set which is ISO 10646/Unicode.

 Wow, ok, I guess that idea wont work for Korean.
 Also, since glyph swapping has to be done for merely adjacent
 characters,
 doing it for combining ones must be a relatively minor concern.

 Out of curiousity, how many of those Korean letters are actually
 made use of by the language? 1.5 million sounds higher than any
 number of phoneme's that a human can produce

   Needless to say, modern Korean speakers can pronounce only
a very very small fraction and chances are that the number will decrease
as time goes by because as in most other languages, speakers are on the
winning side of the battle between listeners and speakers.  You have to
understand that Korean Hangul is alphabetic and the number of possible
syllables that can be made out of a finite set of alphabetic letters is
infinite whether it's Latin, Greek, Cyrillic, Indic or Korean.


 (what if the cluster jamo's were dropped?)

   It doesn't make any difference at all. Cluster Jamos can be
represented as well by a seqeunce of basic Jamos.  Please, note that
the most generic form of Hangul sequence is given as

   L+V+T*M?

where L, V, T, and M denote leading consonant, vowel, trailing
consonant and combining mark(for Hangul, it's most likely to be
one of two tone marks and '+', '*', '?' have their usual meanings
in RE.

That's why I wrote that cluster Jamos shouldn't have been encoded at all.
The same is true of all those 11,172 precomposed syllables. For Korean
Hangul, all we need are about a few dozens of basic Jamos. I feel 'guilty'
(although I haven't been involved in any way forcing them through)
that Korean Hangul took about a fifth of BMP codespace when about
two hundredth of that is enough.

 Are we heading for a long-run scenario, where Form-D becomes canonical,
 and all the old pre-composed codepoints are deprecated? NF-C seems
 to be getting more and more entrenched from what I can tell...

  Well, from the very beginning, UTC didn't want to have precomposed
forms in Unicode. Precomposed characters are not there because they wanted
to encode them but because they had to maintain 'compatibility' with
legacy coded character sets in which they're encoded as seprate entitites.
If they had been able to start afresh without any concern for
legacy character sets, there would have been NO precomposed
characters that can be represented by sequences of base characters
and combining characters.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Why doesn't Linux display Japanese file names encoded in UTF-8?

2002-12-03 Thread Jungshik Shin
On Wed, 4 Dec 2002, Jim Z wrote:

Jim,

This time, I hope my answer will solve your problem :-)

 From: Jungshik Shin [EMAIL PROTECTED]
 On Tue, 3 Dec 2002, Jim Z wrote:

You can easily  add 'Japanese(UTF-8' to your gdm/kdm language
 selection menu. See
 https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829
 I couldn't get into here and is it a typo? PLEASE help - I really want to

   I'm sorry it's https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829

   I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese
   characters for those entries that are encoded in UTF-8. Then I switched
 the
   locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong
 with
   this picture?

 It's an UNIX (Linux) to UNIX (NetBSD) mount. The UTF-8 Japanese file names
 are in my NetBSD:/etc/exports. I can only mount those entries that are ASCII
 equivalent. I also tried it from Solaris 8 (logged in as 'Japanese UTF-8
 (Unicode)') and it worked fine. I am sure if I can turn on UTF8 mode I
 should be able to do so.

  NFS should be encoding-neutral just like the rest of Unix FS
is. (except for cases like exporting to and from non-Unix systems where
different file systems are used.). Why don't you begin with a simpler
case? Before using UTF-8 for directory names to export via NFS,  you can
begin with making sure UTF-8 filenames under a NFS-exported directory
come out all right on the client side.  BTW, I've just experimented
with UTF-8 directory names in export list(/etc/exports), it worked fine
between Mandrake 9.0(server) and RedHat 8.0(client). Judging from this
and the fact that Solaris and NetBSD worked fine, it should also work
between NetBSD and RH 7.3


Needless to say, you have to run your shell in UTF-8 terminal
 (e.g. xterm 16x or mlterm) to view UTF-8 characters.
 
 I can't get it to work. 'xterm -u8' doesn't work. the locale never changes.
 From Solaris you can do a LANG=ja_JP.UTF-8 dtterm  and the new dtterm has

   You have to do the same for xterm as you do for
dtterm. 'LANG=ja_JP.UTF-8 xterm'. '-u8' option is not necessary for recent
xterm. Or, you can do in the opposite order. That is, run 'xterm -u8'
and then set LANG to ja_JP.UTF-8 in xterm (UTF-8). Actually, you have to
do the latter way if your /etc/sysconfig/i18n or ~/.i18n sets $LANG to
a value other than ja_JP.UTF-8 because the shell initialization script in
RedHat *overrides* the value set before the shell invocation with the value
in /etc/sysconfig/i18n or ~/.i18n.(see /etc/profile.d/lang.(sh|csh)).

 what is mlterm? Couldn't find it on Linux 7.3.

  I'm not sure if it's in RH 7.3. You can get it at
http://mlterm.sourceforge.net

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Why doesn't Linux display Japanese file names encoded in UTF-8?

2002-12-02 Thread Jungshik Shin



On Tue, 3 Dec 2002, Jim Z wrote:

 I created a few Japanese file and directory names in UTF-8 in Windows. Then

  How could you make filename and directory names in UTF-8 in Windows?
Windows(both NTFS and VFAT) use UTF-16 for filenames.

 I logged in from Linux (7.3) that is configured to run Japanese. From the
 login 'language' I can only select 'Japanese (eucJP)' (there is no Japanese
 (UTF-8)).

  You can easily  add 'Japanese(UTF-8' to your gdm/kdm language
selection menu. See
https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829
Or, you can just set it in ~/.1i8n.

 I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese
 characters for those entries that are encoded in UTF-8. Then I switched the
 locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong with
 this picture?

  How did you mount Windows filesystem? With smbmount or NFS? If it's
NTFS that is mounted via samba, you have to specify
'iocharset=utf-8'. If it's VFAT exported over the net, you also have
to specify codepage(for Japanese, it's 932). For local filesystems,,
specifying 'utf8' (and 'codepage=932' for VFAT) option to mount command
would be sufficient.  (see the man pages of mount(8) and fstab)

  Needless to say, you have to run your shell in UTF-8 terminal
(e.g. xterm 16x or mlterm) to view UTF-8 characters.

  Now in case of NFS, I have no idea how 'Windows NFS server'
translates UTF-16 used in NTFS and VFAT to multibyte encodings.  There
must be a server config. option for that.(the default might be the 'ANSI'
codepage of the current locale. For Japanese, it's Windows-932/Shift_JIS)
For Unix NFS server - Unix client, there's little need for encoding
translation although having one would be nice for some cases(e.g. EUC-JP
on the server and UTF-8 on the client-side)

  Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: going past the bmp

2002-11-28 Thread Jungshik Shin



Thank you for the note, Owen and Bruno.

On Thu, 28 Nov 2002, Owen Taylor wrote:

 The path to adding full beyond-the-BMP support to Pango is
 pretty straightforward. (I'm a little suprised that it doesn't
 sort of work now for TrueType fonts, but I haven't tested
 it at all.)

 So, what I wrote about 'UTF-32 cleanness' was not the case. There are
some libraries that support BMP only for the momemnt. As for Pango,
I had the same thought as yours. I mean, for truetype fonts, I thought
it would work as it is.


On Thu, 28 Nov 2002, Bruno Haible wrote:

 Jungshik Shin writes:
  kwin(in KDE 3.x) can't handle non-BMP characters in the title
  bar of windows.

 The cause is probably that Qt's internal string representation is
 based on UCS-2.

  Aha.

 They fear to switch to UCS-4 because of the memory
 consumption.


 They don't have to as Win32 and Java showed. If they're
worried about the memory consumption, they can just use UTF-16 instead
of UCS-4/UTF-32.  Win32 and Java showed that it's relatively easy (at
least much less complicated than supporting traditional variable length
encodings) to modify APIs to support UTF-16(UCS-2 + surrogate pairs to
represent non-BMP characters) instead of UCS-2.

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ISO9660 UTF-8

2002-10-21 Thread Jungshik Shin



On Mon, 14 Oct 2002, Markus Kuhn wrote:

 Jungshik Shin wrote on 2002-10-14 06:37 UTC:
When I made a patch, I wrote to the maintainer of cdrtools,
  but his response was not so positive. At first, he asked me whether


 Try again, he's just busy. (I interacted on this with him as well)

  Ilya did (I sent him an email on his behalf becaus his ISP
is blacklisted). In his reply, he wrote that mkisofs is currently frozen
for an imminent major release. Perhaps, in next cycle of development,
iconv() will be considered. One of his concerns was how to detect the
availability of iconv(3) with autoconf. I pointed out that iconv.m4 for
autoconf had been written by Bruno. So, this should not be a problem.

 However, I had to tell him that there's another hurdle to overcome.
My patch hard-coded 'UTF-16LE' as the codeset name for 'UTF-16 Little
Endian', but it's not very portable. There should be a way to detect
the codeset name to use with iconv(3) on a given platform for UTF-16LE.
Is there any autoconf macro written for this?

  One way I can think of is to first detect the codeset name for UTF-8
(utf-8, utf8,utf_8 and uppercase variants) by iconv_open with two
identical codesets and then try iconv_open with a set of candidate
names for UTF-16LE and the detected UTF-8 name. Then, invoke iconv()
with a known UTF-8 string and check the result for endianness.

  An alternative is to just make it user-configurable at  run-time.
This is easier for programmers, but not so user-friendly...



 He means ISO 13346 and its profile UDF 2.01.

info. on 13346/UDF. snipped

 Thank you for the info. on ISO 13346 and UDF.

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Please do not use en_US.UTF-8 outside the US

2002-10-18 Thread Jungshik Shin



On Thu, 17 Oct 2002 [EMAIL PROTECTED] wrote:

   It would be yet simpler to eliminate all non-utf-8 locales.

This is what RedHat 8.0 does except for CJK for which still legacy
encodings are used.(well for zh_CN, GB 18030 is used, which is just
another UTF in a sense.) The exclusion of CJK in a switch-over to
UTF-8 is very unfortunate (I've been using ko_KR.UTF-8 for over
half a year and I really like it)  and I hope it'll change soon
(see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829) As I
wrote many times before, Korean desperately needs UTF-8 and that's why
ko_KR.UTF-8 was among the very few UTF-8 locales offered for Solaris
and AIX (see Ienup's message.) in mid-1990's.


  It would be simpler, but since the vast majority of the world is still
  using legacy locales, it's irrelevant.  Come back in 5-10 years, maybe;
  I'm talking about things that can be done today.

 They could still be available, but they would not be the default
 (legacy encodings)

 When you setup a new machine, its not front-loaded with scads
 of text file docs you care about; you will add things as you go.
 If you recieve new messages (email,documents,etc) they would
 all be converted to something you can read normally. All you care
 about is that it is well integrated and it works.

  I totally agree with you.

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Please do not use en_US.UTF-8 outside the US

2002-10-17 Thread Jungshik Shin



On Thu, 17 Oct 2002, Thomas Wolff wrote:

 wolfffscce14:~ uname -a ; locale -a | grep UTF-8
 SunOS fscce14 5.8 Generic_108528-12 sun4us sparc FJSV,GPUSK
 en_US.UTF-8
 sv.UTF-8
 sv_SE.UTF-8
 sv_SE.UTF-8euro

  In principle, you could set
 
LANG=de LC_CTYPE=en_US.UTF-8
 OK, I get:

 wolfffscce14:~ LANG=de LC_CTYPE=en_US.UTF-8 /bin/sh
 couldn't set locale correctly
 couldn't set locale correctly

  That's probably because you don't have 'de' locale installed.
Have you tried 'LANG=sv_SE.UTF-8' if Swedish is all right with you?
If that's the case, you don't have to set LC_CTYPE to en_US.UTF-8.
Or, you can unset LANG and set other LC_* as you wish.

LC_CTYPE=en_US.UTF-8 or sv_SE.UTF-8  (character classification,
  collation and so forth would behave
  differently)

LC_MESSAGES=C  (if just plain English is better for you than
localized messages)

LC_TIME=C  (again, just want plain old Unix/Posix behavior)

.

 I want an LC_* setting that tells my applications to use UTF-8 and
 doesn't affect the system inappropriately otherwise, and that works
 with SunOS and doesn't let /bin/sh choke!

  I don't know why Sun doesn't ship its Solaris with all the locales
supported by Solaris. Perhaps, a marketing ploy :-) DEC (now Compaq and
should it HP by now?)  Digital Unix 4.x (now Tru64) came with all the
locales on OS CD-ROM. It's up to the system administrator which locale
is installed.

  Jungshik Shin


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ISO9660 UTF-8

2002-10-14 Thread Jungshik Shin




On Sun, 13 Oct 2002, Ilya Konstantinov wrote:

Dear Ilya,

Thanks a lot for 'fixing my patch' :-)

 I'm attaching a patch which complements Jungshik's original patch
 ( http://mail.nl.linux.org/linux-utf8/2002-03/msg00022.html )
 which made mkisofs use iconv instead of internal Unicode conversion
 tables.

 Jungshik's patch already worked well for 8-bit encodings, but it didn't
 account for UTF-8, which is a varying character length encoding. The
 attached patch modified joliet_strlen so that it'll return the
 correct target UCS-2 length.

 Without this patch, UTF-8 filenames containing non-Latin characters
 won't work on Windows. They would show in directory listings and be
 accessible by 8.3 names, but not by their long filenames. This patch
 remedies this problem.

 Ahah, that's the cause. With my patch, I was able to burn a
CD with Korean filenames(in CP949 or EUC-KR which is also a multibyte
encoding like UTF-8) which Linux doesn't have any problem accessing(I
mounted it as a  joliet CD-ROM instead of ISO9660) However, under MS
Windows, it has the very problem you mentioned that your patch solved.

 How do we go about merging this into the cdrtools package?

  When I made a patch, I wrote to the maintainer of cdrtools,
but his response was not so positive. At first, he asked me whether
iconv(3) is available on any platform other than Solaris. After I replied
that iconv(3) is a standard API specified in Single Unix Spec and that
Glibc 2.2.x has had it for a few years and Bruno's implementation
of iconv(3) in libiconv is  widely available and had been ported to
virtually all platforms, he didn't reply.  He eventually wanted to move
onto a more generic format (for DVD and similar media) whose name is
currently escaping me. Anyway, I guess it's not a bad idea to give it
another try to make a case for your patch to him. Why don't you write
him with detailed explanation of what your patch does and
the wide availability of iconv(3) on multitude of platforms? The address
should be available in cdrtools document and web page.

  Although it's desirable to fix things in as upstream as possible,
we may try to go around a bit and persuade various Linux distribution
builders to apply our patches to cdrtools shipped in their
distros. Engineers from RH, Mandrake, PLD and SuSE and perhaps other
distros are here Linux-UTF8 list. Could you pick up our patches
and apply it to cdrtools?

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [Devel] Re: Linux Console in UTF-8 - current state

2002-10-10 Thread Jungshik Shin




On Wed, 9 Oct 2002, Antoine Leca wrote:

 En Vadim Plessky va escriure:
 
  |And presumably FreeType2 will have, or acquire, the smarts for
  |rendering the Arabic and Indic scripts properly.
 
  I am wondering *how important* those Arabic and Indic scripts?
  While there is a certan number of people living in those countries, I doubt
  that they have a lot of computers, and nuymbe rof *Linux* users from that
  number is quaestionable, too.

 To add another complexity, there is no current agreement about the way to
 encode Indic fonts. Besides proprietary glyph-based encodings (that clearly
 do not scale up), the Apple scheme looks like a dead way, so the only

 You may or may not be right about AAT and ATSUI
(http://developer.apple.com/intl).   As long as Mac is alive,
they'll live on. BTW, there's a third contender, Graphite
developed at SIL.

 solution I see is the OpenType scheme, which fits more or less with
 Unicode (but lags about 6-8 years later), and is initiated (and as I see

   It seems like support of Indic scripts in OTF has been rapidly emerging
and MS Windows 2k/XP has a pretty good support of a few Indic scripts
using OTFs and Uniscribe. More and more Indic scripts will be supported
as time goes by. I heard that there are lots of talented programmers/font
developers on MS's typography list(?) interested in OT fonts for Indic
scripts.  Besides, I don't think Pango is much behind Uniscribe supporting
Indic scripts with OTF.

 things, still currently owned) by Microsoft, something that is not really
 welcome in the Linux community ;-).

  I'm not sure what you mean by 'owned'. Opentype standard
has been developed jointly by MS and
Adobe.(http://www.microsoft.com/typography) I don't think Linux
developers/users are so stubborn to reject anything invented by MS. Pango
developers are certainly not because they've been working to support
Indic scripts with OTF(as you know well: Pango 1.1.1 now supports
Indic scripts with code ported from ICU.) Neither are developers of
XFree86 and Freetype library and the author of Yudit.


 As a result, I do not believe that efforts for the Indic scripts are
 likely
 to be successful for the very next years: this is probably more of a
 long-term project; consequently, I believe that Indians will continue to
 use English when speaking with computers for a few years...

  As far as Linux-console is concerned, I agree with you. However,
on the GUI front, I'm not so pessimistic as you're because we already
have some tangible results. IMHO, Linux can't afford to lose hundreds
of millions of potential users in South Asia  when competing OS like MS
Windows 2k/XP and MacOS X are moving forward on the front.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux Console in UTF-8 - current state

2002-09-11 Thread Jungshik Shin


On 11 Sep 2002, H. Peter Anvin wrote:

  On 10 Sep 2002, H. Peter Anvin wrote:
   The only sane way to deal with this is to do a console daemon in
   userspace...

 Reinventing Xterm is more like it.  One of the ideas that has come up
 is to write such a console daemon so that it could also run in an X
 window, which would give us something we right now sorely lack -- a
 consistent terminal in a window and on the console.

  Did you mean 'iterm' briefly mentioned by Redovan in this thread?
On xfree86-i18n list,  Hideki Hiura gave the details at

http://www.xfree86.org/pipermail/i18n/2002-August/003405.html

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: a basic question

2002-09-07 Thread Jungshik Shin


On Mon, 2 Sep 2002, Markus Kuhn wrote:

 M.P.N. Peters wrote on 2002-09-02 12:10 UTC:
  Recently I found out about Unicode and UTF-8. Unfortunately, it raise
  s a lot of questions. My first question is, how can I, with a limited
  (= qwerty) keyboard that can generate only about 100 scancodes (I
  think), produce all the keycodes needed to reach for example the phon-
..

   - For more rarely required symbols (e.g., mathematical notation,
 for many people typically also phonetic alphabet), it might be
 a sufficient entry method to chose these with a mouseclick from
 an on-screen menue. Xterm allows you to do this already today
 via the cutpaste mechanism. Just keep a short file that contains
 neatly arranged the Unicode characters that you need to enter most
 frequently in your work, and cutpaste from there. That's the
 technique I find myself using most frequently.

  One can also use 'ucm'
(http://www.pps.jussieu.fr/~jch/software/files/ucm-0.3.tar.gz)
by Juliusz for this purpose.


   - Have in the keyboard driver a key combination that initiates
 hexadecimal entry of a Unicode character, as a fallback mechanism
 for expert users

   As you know well, it's implemented by some application programs
(e.g. Yudit and Vim). Having this in the keyboard driver may be a good
idea. Some MS Windows applications using 'richtext edit' control (or
sth. like that) have this where 'Alt-X' followed by 4 hex digit produces
a Unicode character. There's even an ISO standard for this. It's very
generic and  Yudit, Vim and MS Windows method are all compliant to
the standard.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: world of utf-8

2002-08-20 Thread Jungshik Shin




On Tue, 20 Aug 2002, Markus Kuhn wrote:

 [EMAIL PROTECTED] wrote on 2002-08-20 00:29 UTC:
  Does anyone know offhand what other barriers remain to
  sending email as raw utf-8?

 My experience with ESMTP have actually been rather good. The problem is
 less the email system itself, but more outdated auxiliary tools, such as
 programs that convert a mailing list archive into HTML and have been
 written without any appreciation for non-ASCII messages. Many of such

  Most of those tools also  have little notion of RFC 2047/RFC 2231.
(some are pretty good, but not yet perfect.)  The situation is a little
better with stupid web mail services (hotmail, yahoo mail, lycos mail and
a bunch of others geared for local users all over the world), but they're
still far from  multilingual.  Most of these services work more or less
(even with RFC 2047/RFC 2231 encoded headers and RFC 2045-encoded -
quoted-printable/base64- message bodies) in *one* legacy encoding(or
UTF-8 in a few cases) at a time/per user/per account.  However, they
break down if multiple messages in different encodings are present in
a single box. Besides, most of them  set MIME charset in http header
field to the legacy encoding for the language chosen by their users
(e.g.  Shift_JIS for Japanese in hotmail/yahoo mail, ISO-8859-1 for
West European languages, EUC-KR for Korean, Big5 for TC, GB2312 for
SC, ISO-8859-7 for Greek, KOI8-R for Russian etc) regardless of the
actual MIME charset specified in messages so that readers of messages
have to manually override the encoding of their web browsers to read
UTF-8 messages.  Therefore when I write to my (not-so-computer-savvy)
correspondents (including my father) using those 'parochial' web mail
services in a language requiring characters beyond US-ASCII, I have to
use the prefered legacy encoding of speakers of the langauge.


 tools have been written in Perl, and thanks to the excellent UTF-8
 support of the new Perl 5.8, perhaps it is now time for the authors of
 these to have a look at the issue, because all the conversion and UTF-8
 handling infrastructure is now readily there in Perl.

  You're right. Perl 5.8 also has an excellent support for handling
of legacy encodings (Encoding module) so that thoese tools can be truly
multilingual by working primarily in UTF-8 (i.e.  converting all incoming
messages in various legacy encodings to UTF-8 before presenting them
in html.)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: world of utf-8

2002-08-19 Thread Jungshik Shin



On Mon, 19 Aug 2002 [EMAIL PROTECTED] wrote:

  If you only have UTF-8 files you don't need to do anything.  If you
  communicate with other planets (and this message indicates you do :-)

 your message was sent:
 Content-Type:   text/plain; charset=US-ASCII
 which could be considered a utf-8 sub-set.
 Admittedly, sendmail's hangups with the eighth bit make
 sending clear utf-8 documents somewhat unreliable.

 What's wrong with sendmail? Is your machine one of few remaining
machines running antique sendmail 5.x or sendmail 4.x?   Sendmail has
been 8bit-clean since 8.6.x.  Sendmail 8.7.x or higher is strictly
compliant to STD 10/RFC 821 and RFC 1652 (ESMTP extension) and RFC 2045.
If correctly configured, it sends out 8BITMIME messages if it's certain
that the other end of the communication can receive 8BITMIME. Compliant
to RFC 1652, it asks whether or not the other side of the link can
understand 8BITMIME and sends out 8BITMIME if the answer is positive.
Otherwise (i.e the other side is either 8bit clean but not compliant to RFC
1652 or not 8bit clean like totally outdated sendmail 4.x/5.x), again
__compliant to_ RFC 1652, it falls back to quoted-printable or base64.

  It's stupid and/or non-standard-compliant MUAs/MTAs like Outlook
Express and qmail/smail (qmail/smail  violated RFC 1652 a few years ago
when I checked. my apology if they've changed their behavior since)
that have to be blame for sending always base64 or blindly sending
8bitmime without checking the other side's ability.


 Email
 is one embarrasing case where it may take awhile for the
 infrastructure to catch up. (putting all text email in a
 base-64 mime attachment can be said to suck)

  It doesn't have to be a MIME _attachment_. C-T-E of RFC 822 body can
be   Base64/QP as well as   in 7bit/8bit.

MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64

or

MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

is as good as

MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

 Does anyone know offhand what other barriers remain to
 sending email as raw utf-8?

  Why would you bother whether C-T-E is base64/quoted-printable or 8bit?
If your MUA(mail user agent) can't cope with MIME, it's time for you to
consider switching to  a *modern* MIME-compliant MUA.

  Besides, sendmail can convert back qp/base64 encoded _single_part
messages back to 8bitmime messages before delivering them to local
mailboxes.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




extended X11 input methods (was Perl 5.8....)

2002-07-23 Thread Jungshik Shin




On Tue, 23 Jul 2002, Tomohiro KUBOTA wrote:

 Extended input method is also needed.  For example, I cannot input
 both of Japanese and Korean in one xterm session, because there are
 no XIM servers which support both of Japanese and Korean while

  Well, Korean XIM's (such as Ami) do support Korean and Japanese  input
although entering the latter is rather inconvenient :-).  A better example
would have been, as you did later, Japanese and French.  Even in this
case, in theory, a single XIM can be extended to support as many input
methods/keyboard layouts as it wants to. Obviously, we don't want to
do that because that means devlopers of every single XIM have to repeat
what others have done for other XIMs.

 xterm cannot switch XIM connection.  (mlterm can do this, but I

  Seriously, I can't agree with you more that we need a input method
framework under which users of every compliant X11 client can easily
switch among multiple input methods/keyboards (as is possible under
MS Windows and MacOS 9 or X.)  I think IIIMF(Internet/Intranet Input
Method Framework) and its Xlib client IIIMXCF(IIIM X Client Framework)
is a (if not the) way to go. See http://www.li18nux.org/subgroups/im.
Until it's widely distributed (I heard it works well right now) , 'ucm'
can be used for sporadic input of Unicode characters not supported by
the active XIM/keyboard.  Also, yudit(which also lets users switch input
methods/kbd), vim, and openoffice offer their own way for this. As the
last resort, we always have cut'n'paste :-)


   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




a patch to pine4.44 for a better UTF-8(I18N) support

2002-07-12 Thread Jungshik Shin
-1252 -t UTF-8,
_CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8,
_CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP  -t UTF-8,
_CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312  -t UTF-8,
_CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5  -t UTF-8,
_CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8,
_CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8,
_CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8,
_CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8,
_CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8,
_CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8,
_CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8,
_CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8,
_CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8,
_CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8,
_CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8,
_CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8,
_CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8,
_CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8,
_CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8,
_CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8,
_CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8,
_CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8,
_CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8,
_CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8


  There are a couple of problems with my patch.

One of them is that I haven't done anything to fix 'one octet -
one column width model'.  In UTF-8, this false assumption completely
breaks down except for characters in US-ASCII(U+0020 - U+007E) as you
are well aware. Therefore,in  the message display screen, lines are
wrapped prematurely and in the message index screen, headers (subject,
recipient, etc) are truncated prematurely.

  The other is that somehow  the link to 'email  list management
information' at the end of a message with 'list management information'
header does not work. I guess it's easy to fix, but I haven't gotten
around to look into it yet.


  There may be other problems as well. I'll be glad to hear about them,
although I may not be able to fix them as quickly as I wish to.

  BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal.
In that case, you have to set 'character-set' to the encoding of
your terminal (say, EUC-JP) and define your display filters accordingly.

  My goal was to make Pine a text-terminal version of MS OE or
Mozilla-mail in terms of I18N support. With my patch, Pine got
closer to that goal, but is still far from it. Some of features
I want to see include:


  - The encoding(MIME charset) for outgoing emails should be
decoupled from the encoding of a terminal under which Pine
is launched.

  - It should be possible to change the encoding(MIME charset)
of outgoing messages _at the time of_ composition
(as is possible with MS OE and Mozilla-Mail.)
Although going all the way to UTF-8 is desirable,
the reality is that some of my correspondents cannot
deal with UTF-8 messages. For them, I have to
write in legacy encodings. Currently, I  have to
launch another Pine with a separate pinerc to compose
my email in a legacy encoding.

  - The internal encoding conversion (as opposed to relying on
users setting display filters correctly in pinerc) with iconv

  - 'assumed-charset' should  be settable per-folder basis as well as
 globally.


   Hope a lot of people find my patch useful,

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: mk_wcwidth

2002-06-21 Thread Jungshik Shin




On Thu, 20 Jun 2002 [EMAIL PROTECTED] wrote:

 You do realize that people in CJK locales expect some characters to be
 double width that people in European/American locales expect to be single
 width.

 Doublewidth roman letters are in the unicode range FF00-FFFE, so
 when converting from a legacy encoding that assumes the ascii
 ranges are all doublewidth, you map to (ascii+FEE0). With

  Well, legacy _encodings_ like EUC-JP/KR, Shift_JIS,
Big5 and  GB2312(should be EUC-CN)   include _two_ distinct sets of Latin
letters, one set in US-ASCII(or its national counterpart) and the other
set in JIS X 0208 (EUC-JP), KS X 1001(EUC-KR), JIS  X 0208(Shift_JIS),
Big5(Big5), GB2312-80(EUC-CN). It's _only the latter_ that has to be
mapped to full width US-ASCII characters in Unicode. Most CJK input
methods , whether in Unix/X11, MS-WIndows or MacOS, offer a distinctive
way to input full width US-ASCII characters.

 unicode you can even mix double and singlewidth ascii in a
 single document; many of the roman letters became kanji
 when in doublewidth form (for example doublewidth capital
 letter H can mean pornography) and have a different meaning
 than their single-width brethren.

 So a unicode char-cell width function should function identically
 for all locales.

  Not true. Although I'm not among those who like to see Greek and
Cyrillic letters rendered in full-wdith (it's really ugly !!), there ARE
_some_ (I wouldn't say there are many)  CJK people who want to keep them
that way.  Moreover, it's not only Greek and Cyrillic letters but also
line drawings that have locale-dependent width. You may as well read UTR
#11/UAX #11 East Asian Width at http://www.unicode.org/reports/tr11/.


 (I dont know of any unicode support for fullwidth greek or cyrillic,
  but should such a thing be needed, there is room north of the BMP)

  There will be never such thing in Unicode. Only reason the full width
Latin letters are encoded separately in Unicode was that they had been
present in legacy CJK characters with distinct code points from US-ASCII
(half-width) counterparts. See above.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




less-374 patch (was Re: less 358)

2002-05-20 Thread Jungshik Shin




On Mon, 20 May 2002, Zvi Har'El wrote:

 I am using less on a UTF-8 Redhat Linux 7.3 machine. I am having troubles with
 using man, because of the overstiking is not handled properly. I read the
 Unicode HOWTO and compiled less (358) with the patch suggested by
 http://mail.nl.linux.org/linux-utf8/2001-05/msg00023.html
 and the situation improved. However it is not completely OK, as you may easily


  I'm afraid the patch you applied introduced the problem you described
while solving the problem of overstriking in UTF-8 mode. BTW, the
patch (as applied by the author of less in less 361) only works for
two-octet-long UTF-8 characters.


 at the beta version of less (377?), but it didn't adress this bug at all

 The patch you refered to seems to have been applied in less 361
according to version.c file.

  Anyway, attached is a *simplistic*(not perfect)  patch against
less 374(the newest at less home page)
I've just made that apparently solves both issues, overstriking of
three-octet-long UTF-8 characters and underlining and overstriking of
two identical US-ASCII characters in a row ('ff' in 'troff', 'tt' in
'pattern'). It's not perfect because it only checks the first
octet of a two or three octet-long UTF-8 char to see if it's
identical with the char. preceding backspace.

  I tested it under UTF-8 xterm and it worked fine with an attached
test case with 'nroff', U+0411, U+2010, and U+AC00, U+4E00 overstruck and
'pattern' underlined.  Underlining doesn't work for UTF-8 characters(other
than US-ASCII), though. However, this is also the case of less-374
without my patch.

   Hope this helps,

   Jungshik Shin


--- line.c.orig Mon May 20 11:56:34 2002
+++ line.c  Mon May 20 12:53:36 2002
 -592,12 +592,19 
 * or just deletion of the character in the buffer.
 */
overstrike--;
-   if (utf_mode  curr  1  (char)c == linebuf[curr-2])
+   if (utf_mode  c  0x80  curr  2  (char)c == linebuf[curr-3])
{
backc();
backc();
+   backc();
+   overstrike = 3;
+   } else if (utf_mode  c  0x80  curr  1  (char)c == 
+linebuf[curr-2])
+   {
+   backc();
+   backc();
+   STORE_CHAR(linebuf[curr], AT_BOLD, pos);
overstrike = 2;
-   } else if (utf_mode  curr  0  (char)c == linebuf[curr-1])
+   } else if (utf_mode  curr  0  c  0x80  (char)c == 
+linebuf[curr-1])
{
backc();
STORE_CHAR(linebuf[curr], AT_BOLD, pos);


1. nroff
nnrrooffff
nnrrooffffgg ABCD


2. UTF-8 chars : two octet or three otcte long
ББ
‐‐
가가가abbc
一一一가

3. This does not work !! The first octet of a char. following
backspace is the same as the first octet of a char. preceding
backspace, but the subsequent octet is different so that
backspace should erase the char. before it.

가각가abbc
Бӡ

4. pattern : underlined

_p_a_t_t_e_r_n


5. underlining does not work for UTF-8 chars. 
_‐
_Б
_A_B


6. This is the reverse of the common convention(as used by nroff), 
but it works.

‐_
Б_
가_



Re: utf8-utf16

2002-05-13 Thread Jungshik Shin




On Mon, 13 May 2002, Tay, William wrote:

 How/what can I use to convert utf8 to utf16 (Windows) ?

Check out

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp

  WideChar in Windows is at least UCS-2 if not UTF-16.

 If what you're looking for is a command line tool, you can use
iconv(under Cygwin and native) and  native2ascii (that comes with
JDK).

 Also is there anyway I can input and store utf8 encoded strings in a Window
 system?

  Notepad(perhaps only under Win NT4 or up?), Vim, Yudit, Mozilla-composer,
SC Unipad(?), Wordpad, MS-Word, etc...

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Switching to UTF-8 and Gnome 1.2.x

2002-05-09 Thread Jungshik Shin


Hi,

In my transition to UTF-8, I found that Gnome 1.2.x has a lot of files
in mixed encodings. All *.desktop files and .directory files are in
mixed encodings. Entries for [ja] are in EUC-JP, entries for [de] are
in ISO-8859-1/15 and entries for [ru] are in KOI8-R and so on. On the
other hand, corresponding KDE files are all in UTF-8 so that I don't
need to change anything there. Anyway, thanks to Encoding module (to be
included in upcoming Perl 5.8 by default), I was able to write a simple
script to add ko_KR.UTF-8 entries for all [ko] entries in EUC-KR
in *desktop files and .directory files. Below is the list of
directories I have to run my script on:

/usr/share/apps
/usr/share/applets
/usr/share/applnk
/etc/X11/applnk
/usr/share/mc
$HOME/.gnome

Still, I got gibberish in Gnome tip of the day. It turned out that gnome
hint files (usually installed in /usr/share/gnome/hints) are Xml files
in mixed encodings. I don't think they're compliant to Xml standard
because I've never heard of Xml files in mixed encodings. So, I also
had to add ko_KR.UTF-8 entries for all [ko] entries. Even with this,
for some reason unknown to me, whenever I cross the 'boundary'(i.e.
from the last to the first or the other way around), I got gibberish.

Two other  places where languages are tied to encodings are
Gnome help (usually in /usr/share/gnome/help) and Gimp tips
(/usr/(local/)share/gimp/$version/tips/gimp_tips.[lang].txt) I also had
to make UTF-8 version of them.

I believe all these problems have been addressed in Gnome 2.0(RC?/beta),
but still Gnome 1.x are widely used. I thought my experience would
help others who want to move on to UTF-8 as well as distribution
builders.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-06 Thread Jungshik Shin




On Mon, 6 May 2002, Pablo Saratxaga wrote:
 On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote:

  Note for xkb experts who don't know Hiragana/Katakana/Hangul:
  input methods of these scripts need backtracking.  For example,
  in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
  v: vowel) sequence.  When I hit c-v-c, it should represent one
  Hangul syllable c-v-c.  However, when I hit the next v, it
  should be two Hangul syllables of c-v c-v.

 That is only the case with 2-mode keyboard; with 3-mode keyboard there
 is no ambiguity, as there are three groups of keys V, C1, C2; allowing
 for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys

'V-C2 and C1-V-C2' should be 'C1-V and 'C1-V-C2' :-)

To go all the way to Xkb, even three-set keyboard array has to be
modified a little because some clusters of vowels and consonants
are not assigned separate keys, but have to be entered by a sequence
of keys assigned to basic/simple vowels and consonants. Alternatively,
programs have to be modified to truly support 'L+V+T*' model of Hangul
syllables as stipulated in TUS 3.0. p. 53.


 for each consoun: one for the leading syllab consoun, and one for the
 ending syllab consoun. (I think the small round glyph to fill an empty
 place in a syllab is always at place C2, that is, c-v is always written
 C1-V-C2 with a special C2 that is not written in latin transliteration)

  You almost got it right except that IEung ('ㅇ') is NULL at the
syllable onset position (i.e. it's a place holder for syllables that
begin with a vowel and does not appear in Latin transliteration). IEung
is not NULL at the syllable coda-position but corresponds to [ng] (IPA :
[ŋ] ) as in 'young'. To put in your way, V-C2 syllable is always written
as  IEung-V-C2 with IEung having no phonetic value. Here I assumed
we're not talking about the orthography of the 15th century ;-)

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Jungshik Shin



On Sun, 5 May 2002, Tomohiro KUBOTA wrote:

 At 02 May 2002 23:54:37 +1000,
 Roger So wrote:

  I _do_ think xkb is sufficient for Japanese though, if you limit
  Japanese to only hiragana and katagana. ;)

 I believe that you are kidding to say about such a limitation.
 Japanese language has much less vowels and consonants than Korean,
 which results in much more homonyms than Korean.  Thus, I think

  Well, actually it's due to not so much the difference in
the number of consonants and vowels as  the fact that Korean has
both closed and open syllables while Japanese has only open syllables
that makes Japanese have a lot more homonyms than Korean.

 native Japanese speakers won't decide to abolish Kanji.

  I don't think Japanese will ever do, either.  However, I'm afraid
having too many homonyms is a little too 'feeble' a 'rationale' for
not being able to convert to all phonetic scripts like Hiragana and
Katakana. The easiest counter argument to that is how Japanese speakers
can tell which homonym is meant in oral communication if Kanji is so
important to disambiguate among homonyms. They don't have any Kanjis to
help them, (well, sometimes you may have to write down Kanjis to break
the ambiguity in the middle of conversation, but I guess it's mostly
limited to proper nouns). I heard that they don't have much trouble
because the context helps a listener a lot with figuring out which
of many homonyms is meant by a speaker. This is true in any language.
Arguably, the same thing could help readers in written communication.
Of course, using logographic/ideographic characters like Kanji certainly
helps readers very much and that should be a very good reason for Japanese
to keep Kanji in their writing system.

  English writing system is also 'logographic' in a sense (so is modern
Korean orthography in pure Hangul as it departs from the strict agreement
between pronunciation and spelling )  and a spelling reform (to make
English have a similar degree of the agreement between spelling and
pronunciation as to that in Spanish) would make it harder to read written
text depriving English written text of its 'logographic' nature. On the
other hand, it would help learners  and writers. It's always been struggle
between readers vs writers and listeners vs speakers

 xkb can be used.  However, more than half of Japanese computer
 users use Romaji-kana conversion, two-keys-one-hiragana/katakana
 method.  The complexity of the algorithm is like two or three-key
 input method of Hangul, I think.  Do you think such an algorithm
 can be implemented as xkb?  If yes, I think Romaji-kana conversion
 (whose complexity is like Hangul input method) can be implemented
 as xkb.

  I also like to know whether it's possible with Xkb.  BTW, if
we use three-set keyboards (where leading consonants and trailing
consonants are assigned separate keys) and use U+1100 Hangul Conjoining
Jamos, Korean Hangul input is entirely possible with Xkb alone.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




LC_PAPER vs /etc/papersize (was..Re: Please do not use en_US.UTF-8..)

2002-05-01 Thread Jungshik Shin




On Tue, 30 Apr 2002, David Starner wrote:

 On Tue, Apr 30, 2002 at 11:09:55PM -0400, Jungshik Shin wrote:
  However, to me overiding the default at the command line is a perfectly
  good solution.

 Everytime you use a program?
 Stuff like that gets real tiring, real fast
 to me.

  What are shell scripts/aliases for ;-) ? What if your site has
multiple printers with different sizes of paper loaded by default?
How about printers with multiple trays?  Whichever method you use to
set the default, you have to use a command line option or other means
to overide the default. However, I have to admit that you clearly have
a point.  It's not most desirable for programs to derive the default paper
size from the locale *name* assigned to LC_PAPER. It's certainly true that
if programs rely on /etc/papersize instead of mapping the locale *name*
to the default papersize, it's easier to change the default paper size.

 What has to be done is to use the actual *value* stored in LC_PAPER
instead of 'guessing' the default paper size from the locale *name*
provided that LC_PAPER is  a standard locale category. It's not, yet.

  I was wrong to say that LC_PAPER is defined in ISO 14652
(draft).  It's not there. SUS V3 doesn't have it, either. So,
it's not a standard locale category but at least it's available
where glibc 2.2.x is used (i.e. all Linux distributions
including Debian) Even there, nl_langinfo(PAPER_HEIGHT) and
nl_langinfo(PAPER_WIDTH) don't work yet. langinfo.h in glibc 2.2.x has
_NL_PAPER_HEIGHT and _NL_PAPER_WIDTH. Therefore, programmers might
use nl_langinfo(_NL_PAPER_WIDTH) and nl_langinfo(_NL_PAPER_HEIGHT).
However, it's not very portable (both across platforms and over the
time) because I believe '_' at the beg. of _NL_PAPER_* indicates their
non-standard nature.  Now what follows is based on not what it's widely
available (or standard) but what it may be in the future.

hypothetic situation
  How often do you (think people) use papersize other than US letter
(or A4 outside the US)? If the answer is most of time, you can
build your own locale with LC_PAPER defined for the most frequently used
papersize at your site (say, en_US.UTF-8@legal)? Then, you can have

  LC_PAPER=en_US.UTF-8@legal
  LANG=en_US.UTF-8

And a French living in the US may have

  LC_PAPER=en_US.UTF-8@legal
  LANG=fr_FR.UTF-8

  What difference is there between setting /etc/papersize and building
and installing a new locale for your favorite size? Sure, editing one-line
is easier than building a new locale. However, it's not so flexible as
you think.  With en_US.UTF-8@legal built and installed, different users
with different choices of the default paper size (because their offices
have different printers with the primary tray for different papersize)
can happily *share* a *single* system. They don't have to fight over
which paper size goes into /etc/papersize.  Those who mainly use US letter
can just set LANG to en_US.UTF-8 and leave LC_PAPER alone (or they can
specify that to en_US.UTF-8 if they want to). Others who mainly use legal
paper can set LC_PAPER to en_US.UTF-8@legal with LANG set to en_US.UTF-8.
/hypothetic situation



   Jungshik Shin


(1) LC_PAPER definition for US letter goes like this (the unit is mm.)
LC_PAPER
height   279
width216
END LC_PAPER

You can change height and width to whatever value you want.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Please do not use en_US.UTF-8 outside the US

2002-05-01 Thread Jungshik Shin

On Thu, 2 May 2002, Keld Jørn Simonsen wrote:

 The nice thing about LC_PAPER is that it is set either on installation,
 or as part of the normal setup. I think most people knows how to set the
 locale, while some, maybe many, would not know that there be a
 /etc/papersize file.

  Yes, I've been bitten more than once by these 'hidden' files
lurking around in /etc that affect the way programs work.

 LC_PAPER was in 14652 at some time but was taken out, because some
 people thought that it was not useful :-(

  So, my memory was not telling me a lie. I was almost sure I had
seen it in ISO 14652 when I wrote that LC_PAPER is in ISO 14652.
Later when I checked it, it's not there, which led me to believe
that my memory didn't serve me right once more.

  Anyway, what's the plan of ISO/IEC JTC1/SC22/WG20 on this?

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




  1   2   >