from:"Jungshik Shin"

Re: relevance of "[PATCH] tty utf8 mode" in linux-kernel 2.6.4-rc1

2004-03-01 Thread Jungshik Shin

Tomohiro KUBOTA wrote:


From: Jungshik Shin <[EMAIL PROTECTED]>

 Sure, every one of Korean emulators (for EUC-KR and Johab) I have used 
moves two column-widths (a single Korean character) for 'backspace'.
I was rather surprised to know that Japanese terminal emulators don't.


Really?
Please note that the current topic is the behavior on sending a 0x08 to
the terminal, not pushing Backspace key.  It is apparent that pushing
Backspace key should erase one character (not one byte nor one column).
Oops. Sorry I was mistaken. In that case, you're probably right although 
I can't be sure, for the obvious reason, that _every_ CJK terminal (ever 
made) behaves that way.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: relevance of "[PATCH] tty utf8 mode" in linux-kernel 2.6.4-rc1

2004-03-01 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

From: Markus Kuhn <[EMAIL PROTECTED]>

to the left, not one *cell*. I know that this is not what backspace does
in some EUC terminal emulators, but I believe a strong case can be made


A correction.  Not *some* EUC terminal emulators, but *every* EUC
terminal emulators.  Do you know *any* example which is popular
in CJK world and on which a 0x08 moves two columns on a doublewidth
character?
 Sure, every one of Korean emulators (for EUC-KR and Johab) I have used 
moves two column-widths (a single Korean character) for 'backspace'.
I was rather surprised to know that Japanese terminal emulators don't.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Canonical Mode Input Processing with multi-byte character sets

2004-02-24 Thread Jungshik Shin

On Tue, 24 Feb 2004, Derek Martin wrote:

Hi Derek,

> On Tue, Feb 24, 2004 at 08:43:09PM +0900, Jungshik Shin wrote:
> >   Please, read what I wrote more carefully. I did write that deleting
> > the last letter is more useful when you're in the middle of typing a
> > sequence of letter to form a syllable.
>
> I think we're talking past eachother here...  I noted that and I agree
> with it.  It's specifically the fact that once I type the third
> character of a hangeul glyph, I can't backspace and change ONLY that
> last character, that annoys me.  You say that most Koreans prefer that
> behavior, and I believe you.  But I can't for the life of me
> understand why...  ;-)  To me, it seems unnatural and inefficient.

 Sorry for my misunderstanding. As you may know by now, The Korean script
has several different facets. It's alphabetic, syllabic and featural
all at the same time. Therefore, different implementations at different
times on different platforms take different approaches when it comes to
representing and processing the Korean script on computer. Because you
live in Korea now, you must have seen the keypad of Korean mobile phones
and may have learned how to type Korean.  It uses three keys for vowels
and 6 keys for consonants. See how consonants are grouped and you may
understand why the Korean script is featural.

> Almost invariably once I've committed an erroneous syllable, it's not
> the whole syllable I need to replace, but only the last character
> which I flubbed.  Otherwise, if I made a mistake before the syllable

  Anyway, I understand where you're coming from. Your complaint
is perfectly valid. What you want can and must be implemented Actually,
Nabi may already have implemented it because its input automata is based
on U+1100 Hangul Jamos. In addition, I have the same complaint about
the most popular Korean mobile phone keypad. It takes a lot more key
storkes to enter a single syllable and it's annoying to find 'backspace'
delete the whole syllable instead of the last letter typed. However,
9th graders on the street don't seem to have a problem at all because
they can type Korean so fast with the keypad that having to enter a
syllable from the beginning doesn't appear to matter to them.
So, I guess your problem would go away as you get more familiar with your
Korean keyboard and input method.

> > However, incremental search needs to be done with individual letters
> > as unit instead of syllables. I think Indian people have similar
> > needs.

  Incremental search with letters as units was implemented
in only one  program (Korean Emacs : Hanemacs by KIM Kang-hee) as far
as I know.  It would be great if it's implemented in Mozilla's 'type as
you find'.

> > LANG=en_US.UTF-8  (or en_GB.UTF-8, en_CA.UTF-8)
> > LC_CTYPE=ko_KR.UTF-8
> > LC_MESSAGES=en_US.UTF-8 # not necessary unless LC_ALL is set, but
> > LC_TIME=en_US.UTF-8 # just to be sure.
> > ---
>
>
>   # .profile (or whatever)
>   LANG=en_US.UTF-8
>   LC_COLLATE=C  # I like ASCII sorting for most applications...
>   ...
>   export LANG LC_COLLATE ...
>
> Then, when I start up an application where I want to type Korean, I
> originally tried startiing it like this:
>
>   $ LANG=ko_KR.UTF-8 LC_COLLATE=ko_KR.UTF-8 LC_MESSAGES=en_US.UTF-8 gedit&

> 2. Hangeul input via ami simply didn't work.

  There's one missing piece here. Sorry I forgot to tell you. You have
to set XMODIFIERS to '@im=Ami'. If you log on with the Korean locale
selected in KDM/GDM, this variable is automatically set for you on
most Linux distributions. However, apparently you don't so that you have
to set it manually.

> 1. Menus were in Korean

  Really? Hmm, you may have set 'LINGUA' or something like
that (non-standard GNU extension) set to Korean. Make sure it's unset.

> As it happens, until recently the most common case I want to do this
> was with mozilla.  It wasn't a major problem then, because my
> installation of Mozilla had no Korean.  But as my Korean improves, I
> have more and more cases where I want to do this.  Of course, I'm also
> better able to navigate the menus, but that's beside the point...  :)

  Actually, Mozilla language packs work independently of the locale. No
matter what your locale is, you can have Mozilla's menu in any
language for which you have installed the language pack.  However,
Ami works with Mozilla only if Mozilla is launched with LC_CTYPE (or
equivalent) set to ko_KR.UTF-8/ko_KR.EUC-KR. BTW, it should be fixed
to work with any UTF-8 locales. Hmm, I'm gonna add it to the TODO list.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Canonical Mode Input Processing with multi-byte character sets

2004-02-24 Thread Jungshik Shin

On Tue, 24 Feb 2004, Derek Martin wrote:

> On Tue, Feb 24, 2004 at 03:51:22PM +0900, Jungshik Shin wrote:
>
> >   Even worse yet, it depends on when, who and where. If a 'grapheme'
> > (e.g. a 'syllable' in Indic scripts, Korean script) is being formed when
> > 'backspace' is entered, it's desirable to erase just one combining
> > character. For 'committed' graphemes, one want to erase the whole
> > character sequence making up a graphme.
> >
>
> FWIW, I actually disagree with that.  Personally, I find that I only
> want to erase the last character of the syllable far more often than I

  Please, read what I wrote more carefully. I did write that deleting
the last letter is more useful when you're in the middle of typing a
sequence of letter to form a syllable. Once a syllble is committed into
the backing store, however, most Korean people want the cursor movement,
the selection and editing operations like deletion/insertion to be done
syllable by syllable.  However, incremental search needs to be done with
individual letters as unit instead of syllables. I think Indian people
have similar needs.

  These behaviors are default with XIM servers for Korean like 'Ami'
(http://kldp.net/projects/ami ) or 'Nabi' (http://kldp.net/projects/nabi).

> blunt, I find that really annoying, and if there's a way to change
> that behavior, I certainly would like to know how...

  What input method server do you use? The msg strings for Ami are available
in English, too.  That is, setting LC_MESSAGES to en_US.UTF-8 gives you
English menus in Ami.

> can't see how...  Perhaps my biggest problem is that I can't find any
> documentation about using Korean with Linux which isn't written in
> Korean.  Which is all well and good, if you already happen to speak
> Korean fluently...  ;-)

  I used to post 'Hangul and Internet in Korea FAQ' to
soc.culture.korean regularly, but that's a way too outdated by now.
Pls, feel free to ask me off-line if you have any problem.

> >   You're probably  right that issues above had better be dealt with
> > 'user-land' input methods/daemon/whatever if possible. But, then,
> > for characters that have been permitted (not in pre-editing stage),
> > 'user-land' input methods can't do much.  Terminal emulators? ...
>
> It seems like a perfectly viable solution.  But I can't help but think
> that it would be better if the kernel allowed for language-specific
> IME modules in the console/tty drivers.  Then you could deal with it
> uniformly at all levels of input management...  One API to enter
> characters, whether you're typing in a terminal emulator or at the
> console.  What I'm essentially envisioning is that all input

  It's not for kernel, but you may find it interesting to know more
about IIIMF and SCIM. http://www.openi18n.org/subgroups/im/IIIMF/

> about the right way to be able to enter hangeul, while still
> maintaining English menus and messages and such.  So far, my research
> has turned up precious little, and I have only been able to type in

  Well, it's easy. I always do that because I don't like the quality of
Korean translation in most software, commercial or open-source.  Add
this to your ~/.i18n (or equivalent. ~/.profile )

--
LANG=en_US.UTF-8  (or en_GB.UTF-8, en_CA.UTF-8)
LC_CTYPE=ko_KR.UTF-8
LC_MESSAGES=en_US.UTF-8 # not necessary unless LC_ALL is set, but
LC_TIME=en_US.UTF-8 # just to be sure.
---

If you add them to your .profile, don't forget to export them.

---
unset LC_ALL  # just in case, LC_ALL is set somewhere else.
export LANG LC_CTYPE
---

 BTW, you don't need to read Korean to figure out the above yourself :-)
because the relevant information is available in any good POSIX document.
On Linux, try 'man setlocale' and related man pages.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Canonical Mode Input Processing with multi-byte character sets

2004-02-23 Thread Jungshik Shin

On Mon, 23 Feb 2004, Henry Spencer wrote:

> On Thu, 19 Feb 2004, Markus Kuhn wrote:
> > [Things get even more tricky with the available experimental terminal
> > support (e.g., in XFree86's xterm) for combining characters such as
> > diacritical marks, which are characters with wcwidth()=0...
>
> Worse yet, when combining characters are being entered separately, one
> might wish that backspace erase only the latest combining character, not
> the whole sequence back to and including the base character.

  Even worse yet, it depends on when, who and where. If a 'grapheme'
(e.g. a 'syllable' in Indic scripts, Korean script) is being formed when
'backspace' is entered, it's desirable to erase just one combining
character. For 'committed' graphemes, one want to erase the whole
character sequence making up a graphme.

> Personally, I suspect that the best answer at this point is to concede
> that the kernel device drivers live permanently in the world of 8-bit
> character sets, and that functionality such as Unicode input editing
> belongs in a user-level daemon rather than in the kernel.  The vast
> majority of user keyboard input already passes through at least one such
> daemon anyway, so there is no significant efficiency issue any more.

  You're probably  right that issues above had better be dealt with
'user-land' input methods/daemon/whatever if possible. But, then,
for characters that have been permitted (not in pre-editing stage),
'user-land' input methods can't do much.  Terminal emulators? ...

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl & unicode weirdness.

2004-02-04 Thread Jungshik Shin

Glenn Maynard wrote:

On Mon, Feb 02, 2004 at 12:21:40PM -0800, Larry Wall wrote:
 

locales for everyone willy nilly.  So 5.8.1 backed off on that, with
the result that you have to be a little more intentional about your
input formats (or set the PERL_UNICODE environment variable).
   

What's the normal way to say "use the locale, like every other Unix
program that processes text"?  Setting PERL_UNICODE seems to make it
*always* use Unicode:
 

 Another way to say that is to use '-C' option whose meaning changed 
between 5.8.0 and 5.8.1

(It's a shame that Perl doesn't behave like everyone else and obey
locale settings correctly; I thought we were finally getting away
from having to tell each program individually to use UTF-8.  I don't
understand the logic of "RedHat set the locale to UTF-8 prematurely,
so Perl shouldn't obey the locale".)
 

 I tend to agree with you, but not entirely. There are many cases where 
following the locale doesn't work. See the thread in Perl-unicode list 
on the topic:

http://www.nntp.perl.org/group/perl.unicode/2243

(I couldn't find a threaded-view option, but article #2243 through #2286 
are all about this issue so that you can just keep pressing 'next' button).

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Does Hotmail support UTF-8 emails properly?

2004-02-01 Thread Jungshik Shin

Richard Jones wrote :
On Sun, Feb 01, 2004 at 05:35:04AM +0900, Jungshik Shin wrote:

ASCII  are compatible).  For your mail-sending web form, why don't you 
send an email to yourself and view it with mail clients that are well  
I18Nized such as Mozilla-Mail, Mozilla Thunderbird and  MS Outlook Express?


Unfortunately Hotmail is what the majority of the target audience use.
I've now changed the script so that it uses iconv to convert
everything to ISO-2022-JP before sending, and now it works OK in
Hotmail.
 That's unfortunate, indeed. However, it's not that bad if your 
recipients are all Japanese and they don't need to receive non-Japanese 
emails. BTW, I mentioned Mozilla/MS OE as a way to make sure that your 
mail-sending form works correctly because you were not sure that it 
worked correctly.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Linux console UTF-8 by default

2004-01-10 Thread Jungshik Shin

Edward H. Trager wrote:

On Saturday 2004.01.10 20:48:31 +0330, Roozbeh Pournader wrote:
 

On Sat, 2004-01-10 at 20:36, Edward H. Trager wrote:
   

Is there any good reason why implementors would not support the
full range of Unicode -- i.e., UTF-8 up to six serialized bytes?
 

UTF-8 up to four bytes, you mean. See
.
   

I guess I was recalling (from http://www.cl.cam.ac.uk/~mgk25/unicode.html) 
that six bytes allows encoding all possible 
2^31 UCS code points, although
I suppose nothing above plane 1 has been defined.  - Ed Trager
 

Plane 2 has tens of thousands of  Chinese characters and Plane 14 has 
variation selectors and language tags. However, nothing will ever be 
defined above Plane 16. JTC1/SC2/WG2 made a firm commitment to that.

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2004-01-02 Thread Jungshik Shin

On Sat, 3 Jan 2004, Jungshik Shin wrote:

> On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:
>
> > > If you yearn for the old days
> >
> > You seem to have a very slow mind.
>
>   I don't know whose mind is slow. I gave all the necessary information
> and you couldn't still make it work. Here's one more try with a

  I'm sorry I forgot that I always had built Mozilla with a patch
that went into the trunk only a few days ago. That patch was made so
long time ago (and it's only necessary for Devanagari but not for Tamil)
that it was taken for granted by me, but it was not in the tree until
a few days ago. The patch to apply (you only need to apply the patch
if you download 1.6b release source instead of the CVS trunk source)
is available at http://bugzilla.mozilla.org/show_bug.cgi?id=203406
(the last patch uploaded there).

  BTW, X11core build doesn't need this patch to work although with the
patch, it works better.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2004-01-02 Thread Jungshik Shin

On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

> > If you yearn for the old days
>
> You seem to have a very slow mind.

  I don't know whose mind is slow. I gave all the necessary information
and you couldn't still make it work. Here's one more try with a
step-by-step instruction (actually, there's not much to tell you because
you must have taken  most of these steps)

 1. download Sun Indic fonts, which you already did.

 2. Put them (there are two of them) into a directory of your choice
(say, /usr/local/share/fonts), which you must have done already.

 3. Edit /etc/fonts/local.conf or $HOME/.fonts.conf
and add the directory above to the font search path.

You can skip this step if you throw fonts into
one of directories or its subdirectory already listed in
/etc/fonts/fonts.conf, /etc/fonts/local.conf and  $HOME/.fonts.conf
like /usr/share/fonts or /usr/share/fonts/indic

 3b. although not necessary (because fontconfig
 scans font directories regularly), run the following, if you
 want to make sure.

   fc-cache -v -f 

 4. Lanuch Mozilla (built with CTL and Xft) and enjoy.  Your web page
was written in such a way that no further configuration is necessary
on Mozilla's side.

 5. _Optionally_, go to font pref. panel of Mozilla and set Devanagari fonts to
Sun's fonts. Also make sure 'allow documents to use other fonts'
is NOT checked. This is necessary for viewing other Hindi pages.
Because most other Hindi sites don't specify 'lang=hi' [1], you have
to launch Mozilla under hi_IN locale (i.e.
'LC_ALL=hi_IN.UTF-8 mozilla') [2]

For X11core build (with CTL but NOT with Xft), you have to follow the
step (which can be simplified slightly with chkfontpath available on
FC1/RH/Mandrake) described at (or equivalent

http://bugzilla.mozilla.org/show_bug.cgi?id=176315#c14

(The last two fields of XLFD for Sun Indic fonts should be
'sun.unicode.india-0' instead of  'hykoreanjamo-1'). See also

   http://bugs.xfree86.org/show_bug.cgi?id=939

With the encoding file for Sun Indic fonts, you don't need
to make aliases.

If you want to use 'standard' opentype fonts for Devanagari, you can
try the latest (but still old/outdated) patch
at http://bugzilla.mozilla.org/show_bug.cgi?id=215219

[1] BBC Hindi site will begin to use 'lang=hi' in a couple of weeks.
[2] You don't have to once Mozilla bug 208479 is fixed.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-31 Thread Jungshik Shin

On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

> Good. So no need to worry about the html page.

 Actually, there is. By 'sun_devanagair_font', I didn't mean that
you use that verbatim but that you have to replace that name by the
actual name of Sun's font. Besides, it's always a good practice to put
one of five CSS generic font families (serif, sans-serif, etc) at the
end of your font list  as I wrote.

> Remains to worry about Mozilla and/or the X server and/or fontconfig.

  Xserver does only little part in the equation as long as it supports
Render extension. Did you put your Sun's Saraswati fonts (two of them)
in one of directories looked into by fontconfig?

> things work. Am quite prepared to use cryptic names like
> -altsys-saraswati5-medium-r-normal--0-0-0-0-p-0-iso10646-1

  Well, with that XLFD name, Mozilla (X11core build) wouldn't
recognize it as a SunIndic font so that Devanagari wouldn't get rendered
as it should. You have to alias it so that the last two field of XLFD is
sun.unicode.india-0 (or something like that) by editing fonts.alias file
and some other chores involved in the X11 font installation.  That's one
of reasons I told you to use an Xft build.

> but you seem to imply that life is simpler today. Not yet for me.

  If you yearn for the old days of XLFD, X11core fonts and
mkfontdir/mkfontscale/xset fp/chkfont/xfs/fonts.dir/fonts.alias/
fonts.scale etc, you can stay there by continuing to use a non-Xft
(X11core) build of Mozilla. However, for the increasing number of programs
in modern Linux distributions, you won't have a choice soon when gtk2
stops honoring GDK_USE_XFT=0.

> [Answering my own question from yesterday night - the new Mozilla build
> shows as possible font choices things in the output of fc-list on the
> client.]

Where have you been during the client-side font revolution? On Mars ;-) ?
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-30 Thread Jungshik Shin

On Wed, 31 Dec 2003 [EMAIL PROTECTED] wrote:

> [Installed Fedora 1 on a spare machine - compiled Mozilla 1.6b
> after ./configure --enable-ctl --enable-xft . It runs fine (*), but
> doesnt show what I expect to see.]
>
> Let me repeat my question, this time referring to
> http://homepages.cwi.nl/~aeb/moz/test.html

It works fine on my machine with SunIndic truetype fonts installed.
The string there is rendered exactly like the image below.

> [Apart from the obvious Mozilla bugs, there is a change in behaviour.
> The old build showed in Edit/preferences/appearance/fonts actual font
> names, the new build shows font family names. The font names were
> very recognizable: just the output of xlsfonts. These font family
> names have an origin unclear to me. Mozilla does not run on the
> X server, but the X server has the fonts, maybe there is a problem there?]

Not at all.  As I explained at least two times on this list, there are
two flavors of Mozilla-builds, X11core build and Xft (client-side font)
build. The latter does NOT use 20-year old (broken) XLFD based font
selection scheme any more. The font selection in Xft build works more
like that on Windows and MacOS (and more in line with CSS). You don't
think end-users have to care for seeing all those (cryptic to them)
'iso8859-1', 'iso10646-1', 'jis0208.1980-0' and things like that, do you?

Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-29 Thread Jungshik Shin

[EMAIL PROTECTED] wrote:
> Jungshik wrote:
> 
> 
> Thanks !

You're welcome.

> However, I will not pursue this further. Have no time.
> For the time being it seems this is something where Internet Explorer
> works, and Mozilla still requires a nontrivial amount of work.

  There are certainly a lot of things to do, but that doesn't mean
that it doesn't work.

  On Windows 2k/XP, the _default_ Mozilla build works almost as well
as MS IE for complex scripts (except for rendering justfied text
and cursor movement/selection). On Unix/Linux and Win 9x/ME,
you need a CTL-enabled build and the right font.

> (Posted to mozilla-build or so. Awaiting moderator approval.

 If you had used the newsserver (news.mozilla.org) instead of
the mailing list, it'd have been just posted without approval.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-29 Thread Jungshik Shin

On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote:

> [A week or so ago I wrote a multilingual text, and several
> languages failed under default Mozilla. If we succeed in
> getting a version that handles devanagari then a next point

You have to make sure to tag the Devanagari part with 'lang="hi-IN"'
for html and 'xml:lang="hi-IN" lang="hi-IN"' for xhtml (if it's Hindi).
That is, you have to do something like this for Xhtml.

...

...

...

...

You may also 'style' Devanagari parts with the following style:

font-family: sun_devanagari_font,
 default_devanagari_font_on_Windows,
 default_devanagari_font_on_Mac,
 some_free_devanagari_opentype_fonts,
 generic_css_family

The reason you have to put 'sun_devanagari_font' at the beginning
is that 'sun_devanagari_font' is not likely to be installed
on most Windows/Mac OS X  so that it doesn't do any harm
while for Mozilla-Linux, it's essential that it's picked up
_before_ other Devanagari likely to be installed on Linux.

Certainly, things should be easier than this, but that's where Mozilla
stands at the moment.

> for discussion will be vocalized Hebrew. For now the first

  It's not likely to work yet because vocalized Hebrew involves
combining marks (right?).

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-29 Thread Jungshik Shin

On Sun, 28 Dec 2003 [EMAIL PROTECTED] wrote:

> but I tried compiling on a Debian (Woody) and on a RedHat (7.2) machine.
> In both cases Mozilla-1.6b.
>
> For Debian the compiled binary does not run. Errors are like reported:
>  ./mozilla-bin: relocation error:
>  mozilla/dist/bin/components/libgfx_gtk.so: undefined symbol:
>  GetContent__C8nsIFrame

  Obviously, I can't possibly know what's wrong with your Debian
build environment (linker, compiler, etc) :-) Why don't you post to
netscape.public.mozilla.unix newsgroup at news.mozilla.org with
details including the output of 'nm'?

> For RedHat the version compiled with --enable-ctl runs, but still
> does not handle devanagari.

 Did you install Sun's fonts? It only works with Sun's fonts I
mentioned if it's not clear from my post and i18n rel. notes.  Although
there's a way to make it work with a non-Xft build (I wouldn't explain
it to you), I'd recommend you build with 'enable-xft'.

> [On the other hand, adding "--enable-xft" fails (on Debian):
>  checking for xft... Package xft was not found in the pkg-config search path.

  Your Debian seems pretty much outdated as far as Xft/fontconfig is
concerned.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-25 Thread Jungshik Shin

On Wed, 24 Dec 2003, Jan Willem Stumpel wrote:

> It would be nice if solutions to common problems (in this case
> 'how to put an UTF-8 string on to the screen', solved, e.g., by
> Openoffice) were shared between different open-source projects.

 OpenOffice uses ICU's layout engine that supports some complex
scripts but not all complex scripts. In case of AbiWord, I don't know
anything about its internals, but ICU and Pango (http://www.pango.org)
are two obvious choices (both are open-sourced) if its developers want
to support complex scripts (Brahmi-derived scripts - Devanagari, Tamil,
Telugu, Thai, Lao, Khmer, Tibet, etc-, Korean Hangul, Mongolian).
Does it support scripts that require BIDI/RTL (Hebrew, Syriac and Arabic
among others)? Also, note that even Latin, Greek and Cyrillic alphabets
are complex once you go beyond basic stuffs because some languages need
base letter + combining diacritic marks for which there's no precomposed
form in Unicode.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: devanagari question

2003-12-25 Thread Jungshik Shin

On Tue, 23 Dec 2003 [EMAIL PROTECTED] wrote:

> Recently I noticed that for me the sequence U+092C U+093F (b i)
> is rendered by Mozilla as b followed by i, while in fact the i glyph
> should precede the b glyph.
>
> Is something wrong in my expectations? or in Mozilla? or in my
> Mozilla 1.5 setup?

 Devanagari is not supported by the default Mozilla build on Linux
(as noted in the international known issues page.)  On Windows 2k/XP,
Devanagari, Thai, Tamil, Korean and other complex scripts supported by
Uniscribe are supported (although somewhat limited) if you install any
of complex script support packages (go to Control panel / International
or something like that) and reboot.  On Windows 9x/ME, only Tamil and
Korean are supported with 'special' fonts. Thai is supported only
on Thai version of Win 9x/ME.

 If you want to make Mozilla support Devanagari on Linux, you have to
download the trunk source from the CVS, build with 'enable-ctl',
and 'gtk' (for gtk2 + ctl, see mozilla bug 189433) If you like 'Xft'
(as many others do and I strongly recommend), turn on 'enable-xft'
as well. Then, install SunIndic font (truetype version for 'Xft')
available at http://developer.sun.com/techtopics/global/index.html
(follow the link for free Indian font).

> (Funny setup, to be broken by default, but even the release page
> http://www.mozilla.org/releases/mozilla1.6b/known-issues-int.html
> mentions this. See also
> http://bugzilla.mozilla.org/show_bug.cgi?id=201746 .)

 Nothing funny. Complex script support is not that simple especially
when you have to retrofit it. I'd love to turn it on by default, but the
cursor movement issue has to be resolved before turning it on (see bug
203406 as well). And, eventually, we have to use Pango (see bug 215219).

> that source was so dirty - the produced binary failed with errors like
>  ./mozilla-bin: relocation error:
> mozilla/dist/bin/components/libeditor.so:
> undefined symbol: GetViewExternal__C8nsIFrameP14nsIPresContext

 In the mozilla binary directory, you have to run

 $ sh run-mozilla.sh ./mozilla-bin

By directly running 'mozilla-bin', you made it pick up
symbols from some other places (probably, system-wide nspr/xpcom/*
shared libraries installed on your system.)

 BTW, see also http://sila.mozdev.org

 Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-20 Thread Jungshik Shin

On Sat, 20 Dec 2003, Edward H. Trager wrote:

> On Saturday 2003.12.20 15:06:11 +0100, Jan Willem Stumpel wrote:

> > > Actually, no. I think I already explained this.
> >
> > Yes, you did (on 15 December). Sorry. I stand corrected. So: the
> > default language group is determined by the UTF locale (which

   s/UTF// :-)

> > incidentally also determines MozillaÂs GUI font). On Linux, the
> > default language group determines the fonts which Mozilla tries to
> > use (by preference) for displaying all Unicode characters. On

  Yes, unless there are other pieces of information that are more
relevant.

> > Windows, the preferred font is determined by the code range, which
> > seems more sensible, and in your bug report you propose to have
> > the same mechanism on Linux also.
>
> I second that: Regardless of what mechanisms are used, it would be very nice
> if Mozilla worked identically on Linux and on Windows.
 (moved below)
> Also, I assume that it would lead to some slight simplification of
> the Mozilla code base,

  Nobody would ever disagree with you. Do you seriously believe Mozilla
developers would make their tasks more difficult not doing what you
wrote? However, the reality is not that simple. Note that on Linux/Unix
alone, we have a few different toolkits/font technologies to support that
are very different in their characteristics (XLFD vs fontconfig). Aside
from Linux, gecko-based browsers run not only on Win 9x/ME and Win2k/XP
(they're different OS' in many aspects) but also on several Unix', OS2,
Mac OS X, Qnx, and VMS (and an unknown number of embedded devices). There
might (or might not) be a way to abstract away all these platform/toolkit
dependencies, but the current level of the abstraction in Mozilla is not
there yet.  If we could use 'fontconfig' (+ pango or ICU) _everywhere_,
it'd be easy to do that. However, we'd not want to ask Mozilla
users on Windows or Mac OS X to install fontconfig + pango or ICU.
Including them into Mozilla is obviously out of question because Mozilla
without them is already too 'fat'.

> That makes it much
> easier for developers who have to test whether web pages look the same on
> different platforms.

  Well, the platform-dependent font availability is another important
factor that makes the platform parity hard to achieve.

> > Probably not :-( , because when I try it on Win98 with Mozilla
> > 1.5, accessing a page with ÐÑÑÐÐ  ÐÐÑÑÐÐ,
> > Putin is in the Cyrillic preferred font, while Yeltsin is in the
> > Western font. Exactly the same as in Linux.

 There's another factor I didn't mention that affects when/whether
'Unicode char. to script' mapping kicks in. Mozilla-Win tries to stay in
the currently selected font as much as possible to avoid 'ransom note'
style (which looks horrible in some cases) rendering. Therefore, as long
as the current font can cover Cyrillic letters, I believe it wouldn't
switch.  However, I guess 'lang=ru, xml:lang=ru' is regarded as a strong
indication of the authorial intent that warrants the font switching.
(it's been a while since the last time I looked at that part of the code
so that I'm just writing from memory.)

  BTW, Mozilla doesn't do any 'global optimization' [1] in the
font selection as might be done by some word processors or other rendering
engines/libraries (e.g. Pango or ATSUI on Mac OS X). That is, its text
drawing/measuring routines can take only a small text chunk (sometimes
just a single character) at a time and doesn't know anything beyond that.

> > So I _still_ donÂt understand it (including your bug report).
> > Apologies in advance if I have overlooked something obvious..

  You don't have to apologize. It's complicated and the only
way to understand it fully is to read the code and work on it. Although
I worked on Windows and Gtk (Linux/Unix) ports of Mozilla's text
drawing/measuring routines for a while, I don't claim to know every
gory detail. What's certain is that Mozilla developers try to match
what's stipulated in the CSS specification (http://www.w3.org/TR/CSS2)
[2]. Whether they're successful or not is another matter, though.

 Jungshik

[1]
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/PS/FontComposition.ps.gz
[2] See, for instance, http://bugzilla.mozilla.org/show_bug.cgi?id=227889
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-19 Thread Jungshik Shin

On Fri, 19 Dec 2003, Jan Willem Stumpel wrote:

> Jungshik Shin wrote:
>
> > It's impossible to infer the document encoding from 'lang' tag.
>
> Indeed, yes, I presented the URL inserted by jmaiorana to the W3C
> HTML validator and it could not make any sense out of it. Still,
> when I set Mozilla to 'autodetect Japanese' it correctly found it
> to be shift-jis. So it is possible "in a way"; after all, there
> are many text utilities (for Japanese only) that can guess (or
> autodetect) encodings.

  Sure, if you restrict the set of possible encodings to Shift_JIS,
ISO-2022-JP, and EUC-JP (the same is true of Korean encodings, SC
encodings, TC encodings, etc), it's usually possible to detect the
encoding correctly. Some commerical 'encoding detectors' (such as that of
BasisTech) reportedly do even better (over 95% or higher detection rate).
Still, that's just a hint in case you want to get the language in which a
document is written (this is the opposite of what we've discussed) because
in html/xhtml, any encoding can be used to represent any characters.
Of course, after guessing the encoding, one can do some
linguistic/statistical analysis to 'determine' the langauge.

> Aahh.. somethings now dawns on me: perhaps charset applies to the
> WHOLE document and must be determined before any processing is
> done, while lang can apply to individual sections? That is why
> Mozilla does not 'trust' lang for determining/autodetecting the
> encoding?

  Actually, you raised an interesting possibility. There's an
_HTTP_ header 'Content-Language'. Mozilla might be able to take
advantage of it.  It should be an optional feature, but with the
option on, Mozilla  can turn to a charset detector corresponding
to the value of 'Content-Language'. Well, it'd not be very useful.
If an http server is configured (or a server-side script is written)
to emit 'Content-Language' header, it's very likely that it emits
'Content-Type' header with 'charset' parameter so that there'd be no
need for the charset detection.  Another possibility is to make the
universal charset detector to take into account  the 'accept-language'
list (see Edit|Preference|Navigator|Languages).

> It will (and can) autodetect, but only when told to do
> so by the user, not by the document. So probably jmaiorana (who
> said the page displayed correctly) had autodetect Japanese ON.

 Alternatively, the 'universal detector' may have been turned on and
it was successful in detecting the document as in Shift_JIS.
Or, the default charset was set to Shift_JIS although not so likely
given jmaiorana doesn't seem to be a Japanese.

> > The value of 'lang' plays a role ONLY after the identity of
> > characters in documents are determined. See below.
>
> Right. Yes, this is quite clear to me now (finally!). The Mozilla
> algorithm is:
>
> 1. determine the encoding (for the whole document) from the
>'charset' attribute, or by auto-detection as specified by the
>user.

 There are several other hints/clues/factors that go in here, but
basically, you're right.

> 2. determine the font (for the section concerned, which may be the
>whole "body") from the 'lang' attribute.

What's missing in your scenario is author-specified fonts. They're given
more weight than (and combined with) 'lang' if 'allow documents to use
other fonts' is checked.  I think I should file a bug to replace 'allow
... other fonts' with something clearer (e.g. 'honor author-specified
fonts' or 'ignore fonts specified by authors / in documents') because
it's confusing as demonstrated by Edward's confusion.

> If the attributes are missing, there are several fallback options
> and defaults,

> but this is the rule in principle. One default seems
> to be 'the language group is Western'. I can put two fragments of

  Actually, no. I think I already explained this. I'd rather not
repeat here. Instead, you can refer to my bug report at
http://bugzilla.mozilla.org/show_bug.cgi?id=208479. You can
also do the following experiment:

  $ env LC_ALL=ru_RU mozilla
  $ env LC_ALL=hi_IN mozilla
  $ env LC_ALL=ja_JP mozilla

> I must still do a few more experiments to find out what the rule
> is when no lang is specified but the UTF-8 character does not
> occur in the Western font. (and also what the rules are which are
> used by Xprint..)

  If you can decipher (I don't understand them fully) :-), you may want
to take a look at
http://lxr.mozilla.org/seamonkey/find?string=nsFontMetricsGTK.cpp
(especially, FindFont and LocateFont) for
the font selection mechanism 'shared' by G

Re: Unicode fonts on Debian

2003-12-19 Thread Jungshik Shin

On Fri, 19 Dec 2003, Eric Streit wrote:
> I have a "small" question ...
>
> The pages are perfectly rendered on the screen, but when it comes to
> printing, only one encoding is done and all the other glyphs are
> converted to "missing-caracters".

> Why not Mozilla ?

That's partly because Mozilla's printing on Unix have a lot of things
to improve and partly because you didn't configure it properly. Well,
the latter is also partly due to the former (it should be easier and more
intuitive to configure). In my posting in this thread, I explained three
different printing 'modules' and gave some refernces. If you're interested
in printing Latin letters and Cyrillic letters, all three methods should
work, but Xprint and Freetype printing should give you better results
than the default PS module (which is always the case for any script).
How to use Xprint with Mozilla is well documented in
. As for freetype printing, you have to
edit either the global (system-wide) unix.js (found in
places like /usr/lib/mozilla-1.5/defaults/prefs/unix.js. From this,
you may guess where it's actually placed on your system) or
per-profile configuration file prefs.js in
$HOME/.mozilla///prefs.js (where
 is like 'k9xkxtyu.slt') to add the following:

pref("font.FreeType2.enable", true);
pref("font.FreeType2.printing", true); //on by default in mozilla.org builds.
pref("font.freetype2.shared-library", "libfreetype.so.6");
pref("font.directory.truetype.1", "/true/type/dir/1st");
pref("font.directory.truetype.2", "/true/type/dir/2nd");

pref("font.directory.truetype.n", "/true/type/dir/nth");

where /true/type/dir/1st' and '.../nth' are directories with truetype
fonts.

If you edit the latter (per-profile user configuration), you have to use
'user_pref' in place of 'pref'. The latter should be edited while Mozilla
is NOT running. Alternatively, you can edit them by typing 'about:config'
in the location bar. In the 'filter' box at the top of the page, type
'freetype' and you can change the value as you wish by right-clicking
with a pref. entry you want to edit selected. If you want to add a new
entry, you can choose 'New | Entry type' in the pop-up menu that comes up.

Hope this helps,

Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-19 Thread Jungshik Shin

On Wed, 17 Dec 2003, Jan Willem Stumpel wrote:

> [EMAIL PROTECTED] wrote:
>
> > 
>
> ThatÂs a funny one, indeed. When I opened it in Mozilla it was
> displayed as åæäåæååå.For a moment I thought it
> was Chinese (which I do not know) but it is gibberish. Mozilla
> thought it was Chinese Simplified GB 18030. The source says  LANG="ja">. It is Japanese with shift-jis encoding, in reality it
> says ãåãåçãèã. (IsnÂt Unicode fun, allowing to put
> both variants in a mail message, just by copying from the Mozilla
> screen like this..)
>
> So, isnÂt the LANG attribute *more* irrelevant, because it did not
> help Mozilla (1.5a) to display the text correctly?

  It's impossible to infer the document encoding from 'lang' tag.
With NCRs, any document encoding can be used to represent any Unicode
characters. Even if that's not the case, how could you know if it's
Shift_JIS, EUC-JP or ISO-2022-JP or EUC-JP (with JIS X 0213) _purely_
based on the value of 'lang' (suppose we don't have UTF-8, UTF-16, UTF-32,
for the sake of argument).  The value of 'lang' plays a role ONLY after
the identity of characters in documents are determined. See below.

> A META tag
> attribute "charset=shift-jis" added to (a copy of) the page did.
> DoesnÂt that mean that "encoding" is more relevant than "language"?

 Internally, Mozilla works in terms of Unicode. That is,
it has to determine the document encoding correctly (to convert a
'byte stream' in the document to render) to a Unicode character 'stream'
before doing any font selection.  If it mistakes Shift_JIS for GB18030,
what the character drawing routine receives doesn't make sense and the
'langGroup' inferred from the document encoding is "in conflict with"
(with NCRs to represent any Unicode characters, whether they're covered
by the current document encoding, this could happen all the time) the
language specified in the document(a part thereof). Which one is given a
higher priority? IIRC, it's the latter. So Mozilla tries to render what
it regards as 'a document in GB18030' (which is actually in Shift_JIS)
with Japanese fonts if possible.

BTW, as you know, GB18030 is another UTF  so that even without resorting
to NCRs (&#x(hh); or &#..;) it can cover the full range of Unicode.

  Another BTW, it depends on your setting in
View | Character coding | Autodetect setting which character encoding
Mozilla comes up with for unlabelled documents.  If it's set to Chinese,
it'll come up with one of Chinese encodings for a Shift_JIS document.
Therefore, properly labelling html/xhtml/css documents is very important. Try
the document in question with the html/xhtml validator at
http://validator.w3.org and see what it says)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-16 Thread Jungshik Shin

On Tue, 16 Dec 2003, Edward H. Trager wrote:

> On Wednesday 2003.12.17 00:24:54 +0900, Jungshik Shin wrote:
> > Edward H. Trager wrote:

> > >In Edit|Preferences|Appearance|Fonts, Mozilla provides options for
> > >specifying fonts
> > >for various script encodings, so you should be able to fine tune exactly
> > >which fonts
> > >get used.
> >
> > Mozilla's font selection
> > menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be
> > called
> > 'script group').  Only in Mozilla-X11core build,  the loose mapping between
> > 'font encodings' (XLFD-based) and 'langGroups' exists.
> >
>
> I wish I understood this better!
> What exactly does "langGroup" or "scriptGroup" mean in Mozilla?  Can you point me to

 'scriptGroup' is just a term coined by me that I believe is better than
'langGroup' because it's not languages but scripts that are relevant
here. 'langGroup's in Mozilla include 'Western', 'Central European',
'Japanese', 'Cyrillic', 'Arabic', 'Hebrew', 'Tamil', 'Devanagari',
and so forth (just what you see in the font-selection dialog).

> a URL that explains exactly how Mozilla does these things, and how that might
> be different from, say, the xft/fontconfig way of doing things?

  I tried to explain it in my long email you quoted in your previous
email apparently without reading it. Maybe not very clearly, but my
two emails (before your first email in this thread) answered most of
your questions.

> Clearly, from a user's perspective I was led to believe something
> possibly quite different about these dialogs in Mozilla.

  What did you believe was the case? Then, I'll go from there if
necessary.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-16 Thread Jungshik Shin

Edward H. Trager wrote:
On Saturday 2003.12.13 15:23:30 +0100, Jan Willem Stumpel wrote:

Does anyone have a step-by-step description of how to install
Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni?
I am still puzzled on when exactly what font is used for display
and for printing in the various Mozilla versions. Each time I
think 'I got it' it turns out that 'I didnÂt get it'...



I don't know whether the following page will answer your question or not:

http://eyegene.ophthy.med.umich.edu/unicode/#fonts


In Edit|Preferences|Appearance|Fonts, Mozilla provides options for specifying fonts
for various script encodings, so you should be able to fine tune exactly which fonts
get used.  
 I wouldn't use 'fine-tune' and 'exactly'. As I wrote in my previous 
messages, Mozilla's
font selection algorithm is complex and Mozilla contributors (including 
myself)  have put
a lot of time and efforts, but still there are issues. Besides, 
Mozilla's font selection
menu is NOT per 'font encoding' BUT per 'langGroup' (which had better be 
called
'script group').  Only in Mozilla-X11core build,  the loose mapping between
'font encodings' (XLFD-based) and 'langGroups' exists.

There is also a checkbox to "Allow documents to use other fonts" which I
assume means that if the right glyph isn't found in the specified Unicode font, a 
glyph will get picked from whatever remaining installed font has that glyph. 
 No, that doesn't mean that. That checkbox controls whether or not 
author-specified
fonts (via font-family in CSS and font-face in old style html) should be 
given a higher priority
than fonts configured in Mozilla's font selection menu. If it's not 
checked, author-specified
fonts are ignored.

> I see
this happen when I view Chinese pages with unusual characters in them.
Whether the above option is turned on or not, Mozilla does its best to 
render every character.
If it fails, it falls back to transliteration on Windows and Linux (if 
X11core-build is used).
In case of Mozilla-Xft, it uses 4 (BMP) or 6 digit (non-BMP) hex number 
inside a
rectangle.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-16 Thread Jungshik Shin

Edward H. Trager wrote:
On Sunday 2003.12.14 07:57:52 +0900, Jungshik Shin wrote:

On Sat, 13 Dec 2003, Jan Willem Stumpel wrote:


Does anyone have a step-by-step description of how to install
Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni?
Well, you're not supposed to install MS Arial Unicode on Linux at
least in some countries.  


Why not? If one has a valid license to an MS product containing MS Arial Unicode,
then why couldn't one install it on both their Windoze and Linux installations?
  Well, read the EULA. Whether that is binding or not is another 
question, which is why I added 'at least in some countries'.


If you want to install a Pan-Unicode font,

you'd better install James Kass' Code2000(BMP) and Code2001(non-BMP).


With no offense to Mr. Kass' admirable efforts, but I think the Code 2000 Hanzi/Kanji 
glyphs are particularly unsatisfactory in appearance -- and there definitely aren't enough of them
I recommended them solely based on the fact that there's no potential 
'legal' issue with them.

 Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Unicode fonts on Debian

2003-12-14 Thread Jungshik Shin

On Sun, 14 Dec 2003, Jan Willem Stumpel wrote:

> In the Mozilla font preferences you can set font preferences for
> Unicode, as well as for specific languages like Western, Japanese,
> etc. Am I then correct in assuming that the language-specific
> preferences always take priority over the Unicode preferences?
> Even when displaying a Web page which has "charset=utf-8"
> in the headers?

 Yes, it's confusing. I think we should get rid of the font
preference entry for Unicode because that's just confusing (there
is some use for it at the moment, though).  The font selection in
Mozilla is strongly influenced by 'langGroup' (had better be 'script'
or 'script group').  How is it determined? If there's an explicit
specification of the language with 'lang' in html and  'xml:lang' in
xml/xhtml in the document [1], it's honored. If not, it's inferred from
the document encoding. Obviously, this inference doesn't work at all
for utf-8. Currently, Mozilla uses the 'langGroup' corresponding to the
current locale for UTF-8 documents. That is, if you run Mozilla under
zh_TW.(UTF-8|big5|EUC-TW) locale, the langGroup of utf-8 document is
regarded as zh-TW. This doesn't work well and totally breaks down when
you have an iso-8859-1 (or any other non-Unicode encoding) documents with
a lot of characters outside the repertoire of ISO-8859-1 represented
in NCRs. (see http://bugzilla.mozilla.org/show_bug.cgi?id=208479 and
http://bugzilla.mozilla.org/show_bug.cgi?id=91190). To work around
this problem, Mozilla on Windows maps Unicode code blocks to Mozilla's
'langGroups', which achieves what you asked below.

> In other words is there a mechanism (inside
> Mozilla) that says
>
> -  hmm... I have to display the character with number 49436 (hex
> C11C) here.
> -  this character is in the range of Korean syllables.
> -  now has a language-specific Korean font been specified? If so
> IÂll use it.
> -  If not, I use the Unicode font (Bitstream Cyberbit, or
> whatever).

 As I wrote above, on Windows, Mozilla does more or less what you
wrote above. Mozilla-X11core and Mozilla-Xft have different font selection
mechanisms. Mozilla-Xft is strongly dependent on fontconfig, which
gives usually a lot better result than the font selection mechansim of
Mozilla-X11core, but that also makes it hard to fix bug 208479 mentioned
above.

> In other words, are huge "complete Unicode" fonts like Bitstream
> Cyberbit or Arialuni (which I promise not to try to use again..)
> only used for filling in the gaps where there are no
> language-specific fonts available? There does not seem to be much
> point in having them, then?

  You can also configure Mozilla to use those pan-Unicode fonts
(or fonts whose coverage is broad enough) for all langGroups you're
interested in.

> Another question: does Mozilla consider 'Latin Extended A'
> characters like Å (o with macron) to be 'Western'? Many Western

  As I explained above, Mozilla-Win does, but in Mozilla-X11core and
Mozilla-Xft, which character belongs to which langGroup is not a function
of Unicode code point (as it should be) but a function of the current
document encoding and the value of 'lang/xml:lang'.

> fonts (like Times New Roman) have them and display them fine.
> But for instance Bitstream Vera Serif does not have them, and some
> other font (I donÂt know which) is substituted. Which rules are
> used for this substitution? Does mozilla look for them in
> *another* Western font, or does it look in the 'Unicode' font?

  Mozilla's font selection mechanism is so complex that I can't
explain it in a few words (and it's also platform/toolkit dependent).
In Mozilla-Xft, fonts for 'Unicode' langGroup are mostly immaterial,
IIRC (I have to look up the code). Mozilla-Xft searches for a font
to render a character in the priortized list of fonts returned
by fontconfig.  Therefore, what fontconfig returns in response to
Mozilla's query (that usually specifies 'lang' and 'font family name'
but NOT characters to render) determines which font is used to render
which character. Mozlla-X11core is a different story.  Using 20-year
old XLFD makes it very hard to do things right (if you take a look at
nsFontMetricsGTK.cpp at http://lxr.mozilla.org, you'll see what I mean)
and I guess fonts specified for 'unicode langGroup' is refered to at a
certain stage.

> > Mozilla's international release notes is your friend although
> > we didn't give gory details in the document. In Mozilla, goto
...
> Thanks very much for pointing this out. I had found out about the

  You're welcome :-)

> As regards to printing:
> I have (and have had for years) just 'lprng' and 'magicfilter' to
> print on my old Laserjet IIP. Also xprint works with that (as far
> as it works). Is there any point for me (or in general for users
> wanting a 100 % Unicode system) in switching to CUPS?

  I guess magicfilter should be fine especially considering that
you have a non-PS printer. CUPS is handy when you have a PS printer
that's no

Re: Unicode fonts on Debian

2003-12-13 Thread Jungshik Shin

On Sat, 13 Dec 2003, Jan Willem Stumpel wrote:

> Does anyone have a step-by-step description of how to install
> Bitstream Cyberbit in Debian Sid? And similarly for (MS) Arialuni?

Well, you're not supposed to install MS Arial Unicode on Linux at
least in some countries.  If you want to install a Pan-Unicode font,
you'd better install James Kass' Code2000(BMP) and Code2001(non-BMP).
They're available at http://home.att.net/~jameskass.  It'd be nice of you
to pay him $5. He's done a great service by making his fonts available
and deserves some monetary compensation, IMHO. You have to note that
for a good quality rendering, you'd better get fonts specifically
made for a subset of Unicode repertoire instead of pan-Unicode fonts.
Google 'alan wood unicode fonts' and you'll get Alan Wood's Unicode font
site. For Latin, you definitely need to install Bitstream Vera series
(donated by Bitstream). If you're also interested in Greek and Cyrillic,
a set of fonts made available by SIL (Gentium) are good to have.

> I am still puzzled on when exactly what font is used for display
> and for printing in the various Mozilla versions. Each time I
> think 'I got it' it turns out that 'I didn't get it'...

  Mozilla's international release notes is your friend although
we didn't give gory details in the document. In Mozilla, goto 'Help'
and 'Release Notes'. In the release notes web page, follow the link to
'international known issues'.  Basically, there are two different versions
of Mozilla for Linux and three different ways for printing.

  1. X11core font build(with gtk or gtk2 widget) :
 This is what's available by default
 at www.mozilla.org. It renders text using server-side
 X11core fonts, which can be bitmap (bdf), Speedo,
 type1, truetype, CID-keyed fonts, etc. However, all of them
 are 'presented' clients (in this case, Mozilla) as
 a set of glyphs with a certain char. to glyph mapping
 and metrics expressed in XLFD.

  1'  The X11core font build also can take advantage of truetype
  fonts available on the client side if freetype is
  enabled (font.FreeTyp2.enable has to be set to 'true'
  in prefs.js). By default, it's enabled. You have to add
  directories with truetype fonts by editing prefs.js
  in your profile directory (usually,
  ~/.mozilla/${PROFILE_NAME}/${SALTED_NAME}/prefs.js).
  The preference entries for truetype fonts are
  "font.directory.truetype.1", "font.directory.truetype.2", and
  so forth (Mozilla takes a look at the directory explicitly
  specified and does not look inside subdirectories.)
  Alternatively, you can add them in 'about:config' (type
  'about:config' in the location bar). In addition, you
  have to specify the location of your freetype2 shared
  library.

  2. Xft-based build (with gtk or gtk2 widget). This builds
 take advantage of  new client-side font libraries,
 Xft and fontconfig that in turn rely on freetype2 library.
 RedHat rpms available at ftp.mozilla.org are Xft + gtk2
 builds. I guess you can install one of them on debian
 with alien or similar tools. Usually, this builds gives
 faster and better rendering results especially if you're
 interested in viewing non-Western European web pages.

Now for printing.

  1. Postscript printing module : this is the oldest. Some people
 regard this as totally broken and demanded that it be
 removed. Western European users may not have much trouble,
 but if you go beyond that, it begins to show its limitation.
 Even for Western European text, its PS output is far from
 'WYSWYG'. That is, fonts used on the screen rendering have
 nothing to do with fonts used in print-out. It can be used
 with both builds listed above.

  2. PS + freetype2 : You have to enable both freetype (mentioned
 above) and freetype printing. This can be used with both kinds of
 builds. However, old rpms (Xft+gtk2 build) used to come with freetype
 disabled, but recent Xft+gtk2 at mozilla.org seem to have been built
 with freetype enabled.  This gives a reasonable (not very faithful)
 WYSWYG. It's not faithful because the font selection mechanism is
 different for printing and screen rendering. Combined with
 CUPS and other modern Linux print servers, this works rather
 well.

  3. Xprint (http://xprint.mozdev.org). With this, Mozilla
 is an Xprint client (X11) to an Xprint server. You need
 to have an Xprint server running for Mozilla to talk to.
 The font selection mechanism is XLFD-based. Xprint (client-side)
 is enabled in X11core build at mozilla.org, but is disabled
 in Xft+gtk2 build.  Xprint server is available at
 http://xprint.mozdev.org

 More can be found at the aforementioned international known issues
page and links therein.

  Hope this helps,

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

file system conversion tool

2003-12-05 Thread Jungshik Shin


Hi,

I thought some of you might be interested in 'convmv', a file system
encoding conversion utility I just came across. Most of you on this list
are likely to have switched over to UTF-8 and wrote a script or two for
the job.  Nonetheless, it may be handy to have tools like this nearby
so that you can help other 'skeptics' around you to 'convert' to UTF-8.

http://osx.freshmeat.net/releases/144059/

convmv converts filenames (not file content), directories, and even
whole filesystems to a different encoding. This comes in very handy if,
for example, one switches from an 8-bit locale to an UTF-8 locale. It
has some smart features: it automagically recognises if a file is
already UTF-8 encoded (thus partly converted filesystems can be fully
moved to UTF-8) and it also takes care of symlinks. Additionally, it is
able to convert from normalization form C (UTF-8 NFC) to NFD and
vice-versa. This is important for interoperability with Mac OS X, for
example, which uses NFD, while Linux and most other Unixes use NFC.
Though it's primary written to convert from/to UTF-8 it can also be used
with almost any other charset encoding. Note that this is a command line
tool which requires at least Perl version 5.8.0.


Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Linux console internationalization

2003-08-14 Thread Jungshik Shin

On Wed, 6 Aug 2003, Edward H. Trager wrote:

> On Wednesday 2003.08.06 08:29:37 -0400, Chris Heath wrote:

> I (and many others ...) would argue that everyone needs to move to Unicode.

  So do I :-)

> it's going to support Unicode very well, and it is perhaps no longer going to
> support the 3-5 mutually incompatible legacy encodings of your language
> that you previously

  As Beni wrote, luit will help here.

> > * user-space pluggability for extra-heavyweight stuff like Japanese
> >input methods or fonts

> I wonder if the object oriented design of SCIM (Simple Common Input Method:
> http://ns.turbolinux.com.cn/~suzhe/scim/index.html) could support CJK and
> other IMs on the console?

  Or, IIIMF?

> > * bidi text (Arabic)
> > * variable width fonts (CJK),

 Perhaps, CJK 'bi-width' (or dual-width) fonts would be a better name.
Markus' simple wc(s)width(_cjk) can come handy for this. Vim and
Xterm already use them to support optional CJK width convention.

> > * variable-width encodings (Unicode combining chars),
>
> Yes, it would be nice if console worked as well as (or better than)

  There are a couple of  frame-buffer based implementations (user-space)
around to support Indic scripts as well as to Japanese, Korean and
perhaps Chinese (with built-in or external input methods)

> > How important is it to have an in-kernel console?

  Probably, offering a rather simple and robust in-kernel console along
with a full-featured i18nized heavy-weight  user-space console is the
way to go.

  Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-13 Thread Jungshik Shin

On Fri, 11 Jul 2003, Wu Yongwei wrote:

> S***, it seems I made a mistake.  The font selection in Windows 2000 is not
> at all as flexible as Java; it's more like Linux.  Just that the default
> font in the Simplified Chinese version is still Tahoma instead of Song Ti.

 Thanks for checking that out. You saved me some tinkering :-)

> Jungshik must be right that I could change the default font in locale zh_CN
> to make ASCII characters appear nicer.

  With Gtk2 and fontconfig, I don't have to tinker with the  font
configuration as much as before because it looks all right to me.
As for CSS-style font list specification, the infrastructure is already
in place (fontconfig), but the 'UI' part needs some catch-up to do.
For instance, most GUI programs and window managers don't have UI to
let multiple (ordered-list of) fonts be specified (although it's
possible to do so by editing configuration files manually in _some_
cases.)

>  The only problem is that the
> standard locale for Simplified Chinese in Red Hat 8.0 (which I use) is
> zh_CN.GB18030.  I was told that it was possible to change that to
> zh_CN.UTF-8, but I did not find the motive/time to do that.

  It's rather easy. See
.

> Regarding the 'A' APIs in Windows.  Do you mean that there should be some
> API to change the interpretation of strings in 'A' APIs (esp. regarding file
> names, etc.)?  If that were the case, the OS must speak Unicode in some form
> internally.

  Yes, that's what I meant.  Beni already gave some details.

Beni>  win2k does have the option of
Beni> witching the encoding used in the 'A' APIs, it's just global and
Beni>  requires a reboot.

 Yup, I frequently do to test Mozilla under different locales.
Having to reboot is really painful. On POSIX systems, we can just
run   a program under any supported locale at the command line. Under Win2k/XP,
'chcp' works inside a 'command prompt'  (even setlocale() works), but I haven't 
checked out
if there's 'SetACP' (the opposite of 'GetACP').

> remount the partition in an appropriate encoding; if it is on an EXT2/3

  As you found out, there's a tool or you can easily make one as many other
have done.  Once you switch to UTF-8 locale, there's no need to look back.

> partition or on a CD-ROM, then I am out of luck.  Maybe the mount tool
> should do something to handle this? :-)

   In case of CD-ROM, it's not much of an issue. See mount(8) man page and other
man pages referred there.

   Jungshik

P.S. A word of caution. A lot of _text-mode_ programs still assume that a single octet
takes a single screen 'cell', which holds for most legacy single byte and double
byte encodings. This assumption breaks down for UTF-8 and three byte sequences of
EUC-JP and four byte sequences of GB18030 (and eight byte sequences of EUC-KR).
Some of them are modified to cope with two-byte UTF-8 sequences (U+0100 - U+07FF),
but don't work with U+0800 and beyond. Needless to say, combining characters
are not handled in those programs.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-10 Thread Jungshik Shin

On Thu, 10 Jul 2003, Wu Yongwei wrote:

> Jungshik Shin wrote:
>
> > I think it's not so much due to defects in programs as due to the lack of
> > high-quality fonts. These days, most Linux distributions come with free
> > truetype fonts for zh, ja, ko, th and other Asian scripts. However,
> > the number and the quality of fonts for Linux desktop are still
> > inferior to those for Windows.
>
> The problem is mainly not font itself, but font combination.  I really
> cannot bear the display of ASCII characters in Song Ti, which is simply ugly
> (and fixed width).

  Why don't you specify a variable-width font as the system default?
I understand you still don't like Latin glyphs in Chinese fonts. I hate
Latin glyphs in Korean fonts, too.

> locale Linux seems to be able to do so, but in the Chinese locale all is in
> the Chinese font, which is not suitable at all for Latin characters.

  I don't think there's any difference between English and Chinese locales
provided that you meant en_*.UTF-8 and zh_*.UTF-8. You may get an impression
that it seems to work under en_US.UTF-8 because the 'system default font'
for en_US.UTF-8 does not cover Chinese characters and the automatic font
selection mechanism picks up a Chinese font for Chinese characters while
using the default font for Latin letters. On the other hand, in zh*.UTF-8,
the system default font covers Latin letters as well as Chinese characters
so that both Latin/Chinese are rendered with the default font.

  A way to work around is to specify your favorite Latin font ahead
of your Chinese font if CSS-style font list can be used.

> Beginning with Windows 2000, Windows could choose the
> font to use based on the Unicode range (Java does this too).  In the English

  This is a  good feature to have although CSS-style font list works
most of time.  Almost everything we need for this is already in
place (fontconfig, pango). BTW, I haven't seen this available in
Win2k. How can I do that? (not that I don't believe you but that
I'm curious)

> I used an Windows Gtk application, which used Tahoma (an good sans serif
> font) at first.  But after an upgrade it automatically chose to use the
> system default font, which is the Chinese Song Ti.  It took me several hours
> to "correct" the ugly and corrupt (yes, because dialogue dimensions are
> different) display.

  Again, I haven't run Gtk programs under Win32 so that I don't know how
they select fonts. Do they use fontconfig? fontconfig can make a big
difference.

> >> There seems little sense now arguing the virtues of UTF-8 and UTF-16.
> >> Technically they both have advantages and disadvantages.  I suppose we

> >   If MS had decided to use UTF-8 (instead of coming up with a whole new
> > set of APIs for UTF-16) with  'A' APIs, Mozilla developers' headache(and

> > UTF-8/'A' APIs vs UTF-16/'W' APIs and there are many other things to
> > consider in case of Win32.
>

> It seems impossible because there are some many legacy applications.  On the
> Simplified Chinese versions of Windows, 'A' always implies GB2312/GBK.
> Switching ALL to UTF-8 seems too radical an idea about 1994.  At the time

 Using 'A' APIs and UTF-8 does not mean that 'A' APIs are made to work ONLY
with UTF-8.  As you know well, 'A' APIs are bascially for APIs to deal with
'char *'. As such, in theory, it can be used for any single or multibyte encodings
including Windows 932, 936, 949, 950 and 6(I forgot the codepage
designation for UTF-8).

 As Unix(e.g. Solaris and AIX and to a lesser degree Linux) demonstrated,
a single application (written to support multibyte encodings) can work
well both under legacy-encoding-based locales and under UTF-8 locales.

> Microsoft adopted Unicode, people might truly believe UCS-2 is enough for
> most application, and Microsoft had not the file name compatibility burden
> in Unix

  Well, this is an orthogonal issue. POSIX
file system is so 'simple' (which is a virtue in some aspects) that it doesn't
have an inherent notion of 'codeset/encoding/charset'. However, Windows
doesn't use POSIX file system and  using 'A' APIs does NOT  mean that they
couldn't use VFAT or NTFS where filenames are in a form of  Unicode.

> (I suppose you all know that the long file names in Windows are in
> UTF-16).

  Actually, VFAT documentation is so hard to come by that we can just
speculate that it's UTF-16 (it could well be just UCS-2 in Windows 95)

> I would not blame Microsoft for this.

  I wouldn't either and I didn't mean to. I believe they weighted
all pros and cons of different options and decided to go with their
two-tiered API approach. In my p

Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin

On Wed, 9 Jul 2003, Wu Yongwei wrote:

> (excluding the desktop, which I prefer KDE).  But I did have some bad
> experience with Windows Gtk applications running on Chinese versions of
> Windows.  Not for functionality, but for UI.  You are right that they do
> care about Asian languages, but the problem seems that they may not have the
> hands to test on Asian language platforms.  At least not on Simplified
> Chinese Windows.  Not their fault, I must add.  Ah, I cannot bear setting

   I have no experience with Windows Gtk, but it could well be due
to the fact that Win32 APIs come in two flavors, 'A'(NSI) APIs and 'W'
APIs.  MS recommened a few different paths to support both pre-Unicode
("ANSI"-based ) Windows (Win 9x/ME) and Unicode-based Windows
(Win2k/XP). One of them is to use 'MSLU'(Microsoft Layer for Unicode?)
with pure 'W' APIs (not using 'A' APIs at all). Mozilla developers
once considered this approach, but gave it up because it led to a
dillemma. To make Mozilla run under Win 9x/ME, Mozilla developers have to tell
Mozilla users to install MS IE 5.x or later (or MS Office or other programs
that have license to bundle MSLU dll with themselves).  Obviously,
it doesn't make much sense to ask users to install its competitor before
using it (needless to say, the reality is that virtually MS Win users
have MS IE installed so that we don't have to worry...). There may be
other reasons that MSLU path was not taken that I don't know of.

What Mozilla ended up doing is to write our own wrappers and function
pointers for two dozen or so of Win32 APIs that get pointed to
either A APIs or W APIs according to the run-time detection of the
OS (Win9x/ME vs Win2k/XP). Mozilla's transition to this is not yet
complete  (see http://bugzilla.mozilla.org/show_bug.cgi?id=162361 and
http://www.mozilla.org/releases/mozilla1.4/known-issues-int.html)

  It's likely that Win32 Gtk is still dependent on 'A'NSI APIs. However,
this is a pure speculation and could well be completely wrong.

> Linux locale to Chinese, which makes the desktop too ugly to me.  Rationale:
> The good intent of Open Source developers may not result in understanding
> the requirements of Asian users owing to lack of native
> developers/testers/users.

  That's a bit strange. My desktop under ko_KR.UTF-8 locale is not so bad.
  Anyway, it's not yet as pretty as that of Win32.

I think it's not so much due to defects in programs as due to the lack of
high-quality fonts. These days, most Linux distributions come with free
truetype fonts for zh, ja, ko, th and other Asian scripts. However,
the number and the quality of fonts for Linux desktop are still
inferior to those for Windows.

> There seems little sense now arguing the virtues of UTF-8 and UTF-16.
> Technically they both have advantages and disadvantages.  I suppose we have
> presented enough of them in this discussion.

  Let me just add my last comment...

  If MS had decided to use UTF-8 (instead of coming up with a whole new set of
APIs for UTF-16) with  'A' APIs, Mozilla developers' headache(and that of
other opensource developers) mentioned above would have been a lot easier
to cure :-) Of course, this is just one aspect of UTF-8/'A' APIs vs
UTF-16/'W' APIs and there are many other things to consider in case of Win32.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin

On Tue, 8 Jul 2003, srintuar26 wrote:

> > Is it true that "Almost all modern software that supports Unicode,
> > especially software that supports it well, does so using 16-bit Unicode
> > internally: Windows and all Microsoft applications (Office etc.), Java,

> These decisions seem designed mostly to ease compatibility with
> Microsoft's OS.

  I agree. Or, for the lack of foresight...

> The Asian-language argument for UTF-16 seems
> mostly vacuous, and even if it were true it would be the lone

   Here again I agree. The worst case (text made entirely
of chars. between  U+0800 and U+) is 3:2.  With characters
below U+0800 (especially US-ASCII range) mixed up, the ratio is
even lower. For CJK Ext. B and C, UTF-8, UTF-16 and UTF-32 are all
even.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: FYI: Some links about UTF-16

2003-07-08 Thread Jungshik Shin

On Tue, 8 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:

> Dnia wto 8. lipca 2003 05:22, Wu Yongwei napisa³:
>
> > Is it true that "Almost all modern software that supports Unicode,
> > especially software that supports it well, does so using 16-bit Unicode
> > internally: Windows and all Microsoft applications (Office etc.), Java,
> > MacOS X and its applications, ECMAScript/JavaScript/JScript, Python,
> > Rosette, ICU, C#, XML DOM, KDE/Qt, Opera, Mozilla/NetScape,
> > OpenOffice/StarOffice, ... "?
>
> Do they support characters above U+ as fully as others? For Python I know

   Yes. . At least, I know for sure Mozilla and MS IE, MS Office XP
do.  That does not make me a fan of UTF-16.  You shouldn't assume
that others don't do what you're not happy to deal with.

The reason they use UTF-16 is NOT because it's inherently better
than other UTF's(UTF-8, UTF-32) BUT because they (not all) began
with UCS-2 and have a lot of baggages (written in UCS-2) to carry
on.  The prime example of this Win32 W API's. The same is true of
Java, ECMAScript (the transition is not yet complete in case of
ECMAScript), and Mozilla.  (see
http://bugzilla.mozilla.org/show_bug.cgi?id=183156, for instance)

In case of applications written with UTF-8 as the internal string
representation (asked for in another posting), there are lots of
them. Basically, most gnome/gtk applications do because glib and
pango are based on UTF-8. Moreover, there's a programming language
whose internal char. representation is UTF-8 as is well known. It's
Perl. Besides, judging from the fact that Sun's iconv(3) implementation
uses UTF-8 as a hub (instead of UTF-32 as is the case of glibc's
iconv(3)), many programs in Solaris must be heavy users of UTF-8.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

2003-07-06 Thread Jungshik Shin

On Mon, 7 Jul 2003, Wu Yongwei wrote:

> > > I wonder, how many people really want to use Unicode codepoints
> beyond
> > > U+?
> >
> > I don't want to make it incorrect by design just because cases it
> doesn't
> > handle are rare.
>
> It's unnecessary to handle ALL cases.  You could address only issues
> encountered/expected by your end users.  IMHO, it is more important to
> make an application be light-weight and run in 99% cases.  Or, you may
> find your language used by, say, 1 people, and none uses the extra
> features that you spend 40% of your development labour.  And it is

  As you wrote, one can do what one believes. Anyway,  correctly
handling non-BMP characters are not so much difficult (40% of your
devel.  time for 1% constituency seems to me too big an exaggeration
:-) I know you're just maing your case clear...).  Moreover, with
Math characters in plane 1 and MathML more widely used, it'd not be
so rare to find people who want to use non-BMP characters.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Conversion of UTF-16 to UTF-8: For god sakes help

2003-06-06 Thread Jungshik Shin

On Fri, 6 Jun 2003, Edward H Trager wrote:
> On Fri, 6 Jun 2003, Bernhard Kaindl wrote:
> > On Fri, 6 Jun 2003 [EMAIL PROTECTED] wrote:
> > >
> > > > Is there a way?  Can you point me in the right direction?   I need to
> > > > convert just one-page to an 8-bit format so LYNX browsers can use it.

> > iconv --from-code UTF-16 --to-code UTF-8 inputfile >outputfile

  If you want to specify the endianness explicitly, you can add 'LE' and
'BE'.

> "uniconv", distributed with Yudit, should also do the job.
> http://eyegene.ophthy.med.umich.edu/unicode/#convutil

   So do Perl Encode module and native2ascii distributed with JDK. See
'man Encode' and 'man  encoding' if you have Perl 5.8 installed.
In case of native2ascii, you have to chain two together.

   native2ascii -encoding UTF-16  input | \
native2ascii -reverse -encoding UTF-8  > output

  See
http://java.sun.com/j2se/1.4.1/docs/tooldocs/solaris/native2ascii.html

Jungshik
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: diacritic marks for Latin alphabet (Re: supporting XIM)

2003-04-02 Thread Jungshik Shin

Edward Cherlin wrote:

>On Monday 31 March 2003 10:05 pm, Jungshik Shin wrote:
>  
>
>>>>>Let's try some more.
>>>>>á̀ế̀̀î́̀ổ́̀̀û̀̀n̂́̀x̂̉́̀̀
>>>>>  
>>>>>
>
>I'm pleased that the accents are still there after four levels of 
>replies.
>

  That's because all three of us (Gaspar, you and I) do what we preach,
namely, using UTF-8 in our everyday computing :-)

>>>>>Not too bad, except that only the first three accents on
>>>>>each letter are actually displayed, and the dot on the i
>>>>>isn't removed.
>>>>>  
>>>>>
>>  Hmm, I can see only two diacritics in Kwrite with Code2000
>>
>>
>
>Yes, I get only two visible diacriticswith Code2000.
>
   I think Code2000 has some (maybe not so
comprehensive) ot layout tables for Latin letters. I'm copying
this to its author, James Kass. 
  

>>font. I found that you appended as many as five of them to
>>each character in your sample.  What font did you use?
>>Nonetheless, it's a pleasant surprise that Kwrite does more
>>than simple overstriking.
>>
>>
>
>kwrite 4.0
>kde 3.0.3
>Arial Unicode MS (licensed copy) shows 3 diacritics
>  
>
Can you check your font with VOLT (www.microsoft.com/typography)
as to whether it has OT layout tables for Latin letters?  You need
to apply to join the OT developer group to get a copy.
It seems to be the only tool available for  editing OT layout
table.  I hope pfaedit will offer the feature, soon.
 

>kmail 1.4.3
>Courier [Adobe]
>3 diacritics displayed
>  
>

Courier?  Hmm.  How about 'Courier' in kwrite?
So, are multiple diacritics stacked over each other taking *disjoint*
spaces instead of overlapping one another at the same spot?

  Anyway, now I'm wondering what Qt/KDE use for rendering.
Does it use pango(it couldn't be because Pango
doesn't support OT layout table for Latin, yet although
simple overstriking is supported) or has their own complex script rendering
library?

  Jungshik


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

a patch for vim to add 'cjkw' option for CJK users with CJK monospacefonts

2003-04-01 Thread Jungshik Shin

Hi,

Attached is my patch to add 'cjkw(idth)' option to toggle CJK width
option. When turned on, characters with East Asian width class
of 'A'(mbiguous) (see UTR #1? 'East Asian Width) are treated as
having the cell width of 2 instead of 1. The default is off(because
characters affected had better be treated as having the cell width
of '1' 'typography-wise' ) and it's only effective when the fileencoding
is UTF-8.
This option is necessary because in the GUI mode (and in a terminal
where a CJK font is used or a similar option is turned on.
e.g. xterm with 'cjk-width' option), many East Asian
people (CJK) use CJK fonts which have fullwidth (cell width of 2)
glyphs for characters with EA Width class 'A'. With
this patch and 'cjkw' turned on, there's no more inconsistency
between the width of glyphs  for characters like Euro, registered
sign, copyright sign in those fonts  and that perceived by vim.
FYI, xterm has a similar option 'cjk-width'.  Lik xterm,
my patch uses Markus Kuhn's EA width 'A' character table
automatically generated from Unicode 3.2. When Unicode 4.0
is finalized, the table has to be updated.
It'd be nice if the patch can get in soon.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: opentype

2003-04-01 Thread Jungshik Shin

srintuar26 wrote:

(For the sake of argument, if all precomposed glyphs were abolished,
leaving NFC==NFD, then how would we store composition specializations
inside fonts...)
 

 You have to distinguish between characters and glyphs here. The number 
of Unicode
characters representable with a font is different from the number of 
glyphs in the font.
Because as you wrote,  diacritic marks for Latin/Greek/Cyrillic and other
combining characters take different shapes and different positions depending
on where they're used. The same is true of base characters The shape of
a base char. is different whether it's used alone or combined with combining
characters and how many and which combining characters it combine with.

In modern intelligent fonts like opentype fonts, char to glyph mapping 
is not
1 to 1 but m to n where m and n >= 1. 
The way this m to n mapping is
stored in fonts and accessed by  rendering/layout engines varies.
(there's even a proposal to add this intelligence to old X11 BDF.)
Opentype fonts have  layout tables like gsub and gpos that have to be
accessed and activated by rendering engines like Uniscribe and Pango.
The amount of intelligence in embedded opentype fonts is smaller than
that in AAT (Apple's intelligent font format) in that in the former
Uniscribe and Pango should more work than necessary for AAT fonts.
Graphite is another font format(? it uses opentype format, but
its layout tables are different from gsub/gpos and so forth used by
Pango/Uniscrbe) and rendering library pair.

For details, see http://www.microsoft.com/typography
   http://developers.apple.com/fonts
http://www.pango.org
   http://graphite.sil.org
   and Adobe's page
Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: diacritic marks for Latin alphabet (Re: supporting XIM)

2003-04-01 Thread Jungshik Shin

Pablo Saratxaga wrote:

The only latin-script based languages I know that use some accentuated
letters not existing in precomposed form in unicode are Guarani
(it uses "g with tilde") and Chechen (it uses several letters with
a dot above, some exist in precomposed, but others don't).
There may be others, but I only know about those two.
 

  I think orthographies of some African languages also need  Latin 
letters with diacritics for which
Unicode/ISO 10646 have never assigned and will never assign precomposed 
fomts.
And, if  we consider  Old and Middle  European languages, there are  
even more.
Needless to say, IPA(although not a language) is a very 'fertile' source 
of  a number of  accented letters.
(I believe there  are some IPA letters linguists want to use that are 
not given separate
codepoints.)

I didn't and wouldn't count  math symbols here  although  there are  a 
lot of them
with Latin letter  as base char.

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-04-01 Thread Jungshik Shin

Edward Cherlin wrote:

On Monday 31 March 2003 10:40 pm, Jungshik Shin wrote:
 

Edward Cherlin wrote:
   

Have you looked at SILA? It uses SIL Graphite as the renderer
for Mozilla.
http://sila.mozdev.org/
 

Yup. I'm aware of it.  At least for now it's only for Windows,
though. However, we may get some valuable insights from the
project that can be applicatble to 'Mozilla-pango' marriage.
   

I mean the part of the project that says they want to do a Linux 
port of Graphite, and thus of SILA, but not much is going on 
with it.
 

 A couple of issues: I guess OpenGraphite for Linux is not yet ready 
for the prime time
while Pango is mature. SILA currently uses MS COM instead of xpcom. To
make SILA for Linux, MS COM needs to be replaced by xpcom. We'll see
which one gets there first, OpenGraphite or Pango. 

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Mozilla Rendering (was Re: gtk2 + japanese; gnome2 and keyboardlayouts)

2003-04-01 Thread Jungshik Shin

Edward Cherlin wrote:

On Tuesday 01 April 2003 08:02 am, Edward H Trager wrote:
 

Can Jungshik or someone else please clarify for me what
Mozilla 1.3 currently uses for complex script rendering? I'm
seeing differences in rendering of Thai on Linux (horrible)
vs. in Windows (OK) in Mozilla 1.3. 
   

Uniscribe on Windows. It supports Thai. 

  Well, I guess even on Windows, Mozilla does not make use of Uniscribe
(at least it doesn't explicitly as far as I know) and intelligent
fonts with opentype layout tables.  Actually, I'm not sure. I asked 
about this a
couple of times, but got no answer.

I don't know what it uses on Linux, but it uses something that 
doesn't support Thai properly, 

 It sorta does if you compile it with CTL(complex text language) 
feature turned on.
Mozilla source code includes a 'miniature version' of Pango for rendering
a couple of Indic scripts and Thai(contributed by Sun). However, that's 
only for 'plain gtk'
build of Mozilla (not using Xft but old X11 core fonts). A similar 'hack'
(but not depending on Pango) should be possible for Xft-build of Mozilla 
when bug
176290 is resolved  (http://bugzilla.mozilla.org/show_bug.cgi?id=176290)

This is the point about building text rendering into the system. 
Applications cannot have their own rendering engines in general. 
So whatever the system renderer supports is the best you can 
expect in most software (if that).
 

  I fully agree with you. The problem with the current Mozilla is that 
it seems rather
hard to write a bridge to Pango (although I have a couple of 'vague' 
ideas as to how
to do it and I'm sure genuine gurus of Mozilla have their own better 
ideas as well.)
Besides,  I believe Mozilla-Graphite 'marriage' should serve as a good 
model
for Mozilla-Pango couple.

Jungshik

P.S. BTW, Thai can get rendered 'automagically' (well, not so great as 
expected
by Thai people) if you have fonts for simple overstriking with zero/negative
advance for combining characters.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-31 Thread Jungshik Shin

Edward Cherlin wrote:

On Sunday 30 March 2003 11:25 pm, Jungshik Shin wrote:
 

I'm also gonna explore
if it's easier to wed 'pango' with Mozilla  if  gtk2  instead
of gtk is used. That would dramatically improve complex script
handling of Mozilla if possible.
   

Have you looked at SILA? It uses SIL Graphite as the renderer for 
Mozilla.

http://sila.mozdev.org/
 

Yup. I'm aware of it.  At least for now it's only for Windows, though.
However, we may get some valuable insights from the project that can be
applicatble to 'Mozilla-pango' marriage.
Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

diacritic marks for Latin alphabet (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin

Edward Cherlin wrote:

>On Monday 31 March 2003 06:38 am, Gaspar Sinai wrote:
>  
>
>>On Sun, 30 Mar 2003, Edward Cherlin wrote:
>>
>>
>>>Let's try some more.
>>>á̀ế̀̀î́̀ổ́̀̀û̀̀n̂́̀x̂̉́̀̀
>>>Not too bad, except that only the first three accents on
>>>each letter are actually displayed, and the dot on the i
>>>isn't removed. 
>>>
  Hmm, I can see only two diacritics in Kwrite with Code2000 font.
I found that you appended as many as five of them to each character
in your sample.  What font did you use? Nonetheless, it's a pleasant
surprise that Kwrite does more than simple overstriking.

>>>
>>>What do you see in your mail?
>>>  
>>>
>>Yudit currently supports Mark-To-Base and Mark-To-Mark
>>(2.7.5.beta10) OpenType GPOS and it uses GSUB only for Indic
>>scripts, ligatures and shaping. Resonable Tibetan (almost
>>ready) also needs all of these complexities.
>>
>>If there is an urgent need for this in other scripts I can
>>take a look at it. 
>>
>>
>
>Not in Latin-alphabet text generally. Writing systems that have 
>such needs include Vietnamese, IPA, Math, Polytonic Greek, 
>  
>
  Does Vietnamese need diacritic marks ? Sure, it does, but
I think all it needs are encoded as precomposed so that
they don't need a special treatment other than the conversion between
NFC and NFD.

> 
>Indic and South Asian are much higher priority than multiply 
>accented Latin for mathematicians.
>  
>
   That's why Indic scripts are rather well supported in Yudit now :-)

>>
>>Is it possible to define all the combinations in GPOS and GSUB
>>tables in the font at all?
>>
>>

It seems like this is where AAT fonts with state machine are superior to
opentype fonts.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

fontconfig, alias/pseudo-fonts, Xft (was...Re: supporting XIM)

2003-03-31 Thread Jungshik Shin

Mike FABIAN wrote:
(B
(B>Pablo Saratxaga <[EMAIL PROTECTED]> $B$5$s$O=q$-$^$7$?(B:
(B>
(B>  
(B>
(B>>Also, Xft allows to define "virtual fonts" created from a list of other
(B>>fonts; "Sans", "Serif" and "Monospace" come in standard.
(B>>
(B>>
(B>>
(B>~/.fonts.conf
(B>  
(B>
(B
(BI guess Pablo meant something like the following
(Bbut this doesn't work the way he (and
(BI) wrote it would if only Xft APIs are used(see below). For instance,
(B'monospace' is a 'virtual' font defined as
(B
(B
(Bmonospace
(B
(BLuxi Mono
(BNimbus Mono L
(BKochi Gothic
(BZYSong18030
(BAR PL SungtiL GB
(BAR PL Mingti2L Big5
(BGulimche
(BAndale Mono
(BCourier New
(B
(B
(B
(B
(B>>and define some pseudo-fonts you want.
(B>>
(B>>
(B
(B>How does that work? I didn't know that it is possible to define
(B>"virtual fonts" from a list of other fonts using fontconfig/Xft2.
(B>  
(B>
(B>But I don't yet know a *simple* way to achieve that by using only Xft2.
(B>When using something like
(B>
(B>   xft_font = XftFontOpenPattern(dpy, pattern);
(B>  
(B>
(BI guess you have to call fontconfig APIs(e.g. FcFontSort) directly
(Band do manual break-up of your input text into mutilple pieces
(Bto be rendered by one of fonts returned (by FcFontSort) depending
(Bon their coverage. And, you know this *complex* way, don't you?
(B
(B>I always got exactly one font. Are you saying that it is possible to use
(B>more than one font with a single call to XftFontOpenPattern()
(B>by doing some setup in ~/.fonts.conf?
(B>  
(B>
(B
(BI think Pablo mistook what fontconfig does for what Xft does unless
(BI'm missing something Pablo knows. I also plead guilty of making
(Ba similar mistake when I wrote abuot working-around a hard-coded
(Bfont name in a Window manager and a theme (e.g. Courier)
(B
(BJungshik
(B
(B--
(BLinux-UTF8:   i18n of Linux on all levels
(BArchive:  http://mail.nl.linux.org/linux-utf8/

Re: alias in fontconfig (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin

On Mon, 31 Mar 2003, Edward Cherlin wrote:

> On Monday 31 March 2003 04:31 pm, Jungshik Shin wrote:
> >Tomohiro KUBOTA wrote:
> > >I want such "alias" to be automated.  If I have one Korean
> > > font installed, it is obvious that renderer must use the
> > > font for all Korean texts. It is not a good idea that the
> > > renderer fail to display Korean when the user doesn't
> > > configure the "alias".
> >
> > fontconfig always returns a font if there's a font on the
> > system with the character requested.
> > So, it's possible now.
>
> Doing it one character at a time is guaranteed to give hideous
> results. I have had the unfortunate experience of viewing a
> display in mixed CJK fonts, and I have had many similar

   Well, it depends on what kinds of fonts you have on your
system and the way you specify fonts you want to use. I'm well aware
of 'ransom note-like results when you mix up fonts of many *different*
styles and design principles in a single run of text.  This problem can
be minimized if you are careful in putting together fonts of similar
styles and design principles.

   Anyway, if someone finds it difficult to edit fonts.conf
file and doesn't want to install a minimal set of well-populated
fonts  (sans, serif, monospace, etc), but still wants
as many characters as possible to be rendered, randsom note
is what she deserve to get.

> unfortunate experiences of viewing APL code rendered in random
> math fonts. It is extremely important to a lot of people that
> they be able to specify a font *per language*, without regard to

  Well, *per-langauge* is not a cure-for-all although
on many occasions, it's sufficient.

> the definition of Unicode blocks or old-time code pages or
> ISO-8859-* or any other 8-bit font hack. But we want to do it

  We don't live in that world any more largely thanks to
fontconfig, Xft and Pango.  The age of X11 corefonts
and XLFD hack has gone for good.

> There is, of course, the question of defining the character
> repertoire and rendering rules for a language (which may differ
> substantially from the rules for another language written in the
> same script). To get started, it will suffice if I can say that
> the set of characters in one font that I designate defines the
> repertoire for my use of the language. When we have adequate
> support for more intelligent fonts, we can build in some of the
> rendering rules, also, but in the end language-specific document
> creation will be the job of applications well above the text

   In case of html, 'lang' does the job abd Mozilla supports
it pretty well. Unfortunaely, 'xml:lang' is not yet supported.

> editor level. At some point, explicit repertoire lists will be
> needed, I suppose. Or something else we haven't thought of yet.

   Care to take a look at http://fontconfig.org ?
It includes lang-dependent repertoire list for most, if not all,
of languages listed in ISO 639 (or is it ISO 30xx?)?

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-31 Thread Jungshik Shin

Jungshik Shin wrote:

Edward Cherlin wrote:

The starting point of this discussion was the inability to use 
Chinese, Korean, and Japanese IMEs in the same locale. I write 
documents in all three languages, and I would do it more often if it 
were actually convenient.

 This is becoming rather frustrating. How many times do I have to write
that it IS possible right now to install all of them and switch
between them in a *single* application (session) running under any
UTF-8 locale of your choice?   Why don't you try installing


 I'm sorry I  somehow didn't realize (how couldn't I? I don't know...) 
that
you wrote the above probably because I had written that everything that 
you need
for CJK input came by default with modern Linux distros, which  is not
true, and you don't need HOWTO.  Certainly, it's not well known that
it's possible to switch between multiple gtk2 input modules (including
those for CJK)   and it'd be nice to have a well-written summary on the
issue with pointers to various gtk2 input modules. It also would be nice
for major Linux distributions to include them.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: alias in fontconfig (Re: supporting XIM)

2003-03-31 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

- Xmms cannot display non-8bit languages (music titles and so on).

 

  Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1  tag 
as long as
the codeset of  the current locale is the codeset used in ID3 v1 tag.  
   

I'll test this further.  However, please note I won't be satisfied by
"i18n" which require specific configuration other than setting LANG
variable (and installing required softwares and resources).
 xmms does NOT take  anything more than setting LANG. The reason I used 
LC_ALL in
my example is because that's the only sure way to set the locale. If I 
use LANG,
it can get shadowed by LC_ALL and LC_*.  LC_ALL overrides 
LC_* and LANG. Other complications are not the fault of xmms but that 
of  ID3 v1 tag
that does not have any mechanism for specifying the encoding.  ID3 v2 should
solve this problem by using Unicode, but not many programs support it. 
(I doubt
many  portable mp3 players  support it)

I want such "alias" to be automated.  If I have one Korean font installed,
it is obvious that renderer must use the font for all Korean texts.
It is not a good idea that the renderer fail to display Korean when
the user doesn't configure the "alias".
   fontconfig always returns a font if there's a font on the system 
with the character requested.
So, it's possible now.

 

- There are no lightweight web browser like dillo which is i18n-ed.
 

I think that w3m-m17n is an excellent lightweight browser that 
supports I18N well.
   

Well, I meant a lightweight GUI browser.  Though I haven't checked,

 

  It's sorta gui browser. It supports image rendering and mouse.  You  
can also compile it with
JS interpreter .BTW, how about 
Phoenix(www.mozilla.org/projects/phoenix) and Galeon ?

There is another i18n extension of w3m: "w3mmee".  I don't know which
is better.
 

 I'm aware of that. I just wish either of them  (or a combination of 
two) to be included in
w3m.

- FreeType mode of XFree86 Xterm doesn't support doublewidth characters.

 

  Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect
he'll apply it sooner or later. After that, I'll add '-faw' option 
(similar to '-fw' option).
   

Fantastic!  May I want more?  Xterm can automatically search a good
(corresponding) doublewidth font in non-FreeType mode.  How about
your patch?
 

I'm not sure whether I can.  We'll see.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

Edward Cherlin wrote:

>On Sunday 30 March 2003 06:29 pm, Jungshik Shin wrote:
>  
>
>>Edward Cherlin wrote:
>>
>>
>>>On Sunday 30 March 2003 03:26 am, Jungshik Shin wrote:
>>>
>>>  
>>>
>>>I
>>>can't test some of the others myself, and haven't heard any
>>>detailed information on them. I have not found any problems
>>>with diacritics in Latin and Cyrillic.
>>>  
>>>
>>  Well, you do have problems with characters with diacritics
>>in Latin,Greek and Cyrillic for which
>>Unicode does NOT have assigned and will NEVER assign separate
>>codepoints. That's
>>what I was talking about. There are  tens  , if not hundreds,
>>
>>
>
>thousands, if not tens of thousands. I'm a mathematician.
>  
>

  I know how to multiply, too. It doesn't take a mathematician
to multiply, does it?  :-) The reason I wrote tens/ hundreds
instead of thousands/tens of thousands was that I like to
give the number of combinations that have turned up
in existing documents rather than the number of
all possible combinations.

>  
>
>>of combinations
>>(base character + one or more diacritic mark(s)) that can ONLY
>>be represented by combining character sequences. 
>>
>>
>
>Like this? 
>à̀
>It's an a with two accents, and it composes and displays 
>correctly in kwrite and kmail, with one accent above the other.
>
>Let's try some more.
>á̀ế̀̀î́̀ổ́̀̀û̀̀n̂́̀x̂̉́̀̀
>Not too bad, except that only the first three accents on each 
>letter are actually displayed, and the dot on the i isn't 
>removed. Curiously, Yudit doesn't handle multiple accents as 
>well as these simple-minded apps do.
>

 Yudit needs the same change as I proposed for Pango in this mail
and a couple of others. Yudit supports opentype layout table
for several Indic scripts and it needs to do the same for
Latin/Greek/Cyrillic alphabets. SIL has one such font.
Unfortunately, the last time I downloaded it, there's something
wrong with zip and I couldn't try it.
(http://www.sil.org/~gaultney/gentium/index.html)

>
>What do you see in your mail?
>  
>
  I can't tell without knowing what I'm supposed to see.
Anyway, what I see is two diacritics overlapped over
each other instead of taking disjoint 'spaces' alongside
or on top of /below each other.  See 
http://www.columbia.edu/kermit/st-erkenwald.html
for a real life example.

  Didn't I specifically write that Pango does not support
diacrtic marks combined with base characters while Uniscribe
does (although it didn't until very recently)? I know
that xterm and vim support up to two combining characters
and that's how pre-1933 Korean script and Latin/Greek/Cyrillic
diacritic marks are supported by xterm/vim. I guess kmail/kwrite
do likewise. However, that's a kind of  the last resort when you
don't have a better way to do it properly.  Eventually, what
we need is support in Pango and that's filed as
bug 101079 (see http://bugzilla.gnome.org/show_bug.cgi?id=101079)

Other pango bugs I filed (excluding Korean-specific ones)
include :

http://bugzilla.gnome.org/show_bug.cgi?id=101081
http://bugzilla.gnome.org/show_bug.cgi?id=106624

>The starting point of this discussion was the inability to use 
>Chinese, Korean, and Japanese IMEs in the same locale. I write 
>documents in all three languages, and I would do it more often 
>if it were actually convenient.
>

  This is becoming rather frustrating. How many times do I have to write
that it IS possible right now to install all of them and switch
between them in a *single* application (session) running under any
UTF-8 locale of your choice?   Why don't you try installing
all three of them (im-ja, imhangul and wenju ) and fire up
gedit and right-click on the text input area to see what you have?
The very same information was given in last Decemeber and
this thread doesn't add any new information except for
im-ja in place of other less advanced Japanese gtk2 input modules.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-30 Thread Jungshik Shin

srintuar26 wrote:

As long as input method is concerned,
this thread is almost a replica of the thread  last Dcember and all
these information was  given then (except for KDE/Gnom2
Xkb kbd switcher and  im-ja in which place a less advanced gtk2 input 
module
for Japanese was mentioned by Owen ).  Is there anything wrong with 
collective memory of this list? ;-)
   

Well I for one have been placated for now by im-ja. Its precisely
what ive been looking for, and extensive googling didnt root it out.
 im-ja may have not turned up in google, but the archive of this list 
includes
all the necessary information we went over again  the last week
except for  KDE/Gnome2 kbd switcher. Actually, I'm not sure
of my own memory and that may also have been mentioned in
the past.



XIM has been a disappointment for me, and I got tired of using iconv,
rom2hira scripts, a trivial console based canna interface, and
kanjipad for my input needs. (rh8 uses euc-jp for its Japanese
locale, and I refuse to use non-utf-8 locales, but XIM wont work
correctly or stably outside of the euc-jp locale...)
 

Well, you must not have been on this list long enough. Last Nov/December,
I posted how to make RH8 support ja_JP.UTF-8 and ko_KR.UTF-8.
Most of my changes have been fed back to XFree86 and are included
in XF86 4.3. Hopefully, RedHat 9.0 turn on UTF-8 locale for
CJK by default as I urged them to  on several occasions.
BTW, I've been using ko_KR.UTF-8 for about a year now.
Now if only more apps were gtk2 based...
Mozilla and gvim come to mind.
 

 gtk2 patch for vim works very well. Just try 'vim gtk2 patch' and
you'll get http://regexxer.sourceforge.net/vim.  If you're adventurous,
you can try building gtk2-port of Mozilla yourself. It's being worked on.
I'm gonna give it a shot myself soonish.  I'm also gonna explore
if it's easier to wed 'pango' with Mozilla  if  gtk2  instead of gtk
is used. That would dramatically improve complex script handling
of Mozilla if possible.
  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-30 Thread Jungshik Shin

Evan Martin wrote:

(Following the earlier discussion about XIM...)

http://im-ja.sourceforge.net/
is a pretty effective input module for Japanese input in GTK2.
 

And, you can install *along* its side,

 http://sourceforge.net/projects/wenju/  (includes gtk2 input module(s) 
for Chinese : table-based)
 http://kldp.net/projects/imhangul   : Korean gtk2 input module suite

and other gtk2 input modules for other scripts. You can also switch around
various Xkb supported key layouts as you and others wrote with help
of KDE keyboard swticher or Gnome2 keyboard switcher.Besides, if you 
want,
you can still use one of XIM servers you like to use. I'd rather use the 
built-in
XIM server (Compose for UTF-8 locale)  by resetting XMODIFIERS
env. variable (or equivalents in Xresources).

As long as input method is concerned,
this thread is almost a replica of the thread  last Dcember and all these
information was  given then (except for KDE/Gnom2
Xkb kbd switcher and  im-ja in which place a less advanced gtk2 input 
module
for Japanese was mentioned by Owen ).  Is there anything wrong with 
collective memory of this list? ;-)

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: I18nized apps (was Re: supporting XIM)

2003-03-30 Thread Jungshik Shin

Edward Cherlin wrote:

On Sunday 30 March 2003 06:42 pm, Jungshik Shin wrote:
 

Edward Cherlin wrote:
   

Nadine Kano wrote one, published by Microsoft, which is
unfortunately very much out of date and out of print. I know
of
 

Well,  the book is not just outdated but has some critical
errors/mistakes and Microsoft-centrism(that doesn't work well
for POSIX system) along with useful information. BTW, I
believe MS press released an update to
the book recently.
   

Pointer?

 Google is your best friend. Just typing 'developing international 
software' brought me
right here :

 http://www.microsoft.com/mspress/books/5717.asp

However, I'm afraid the usefulness of this book is limited for 
developers working
on POSIX system because they need to understand and use UTF-8 while on 
WinNT/2k/XP they don't have to worry about variable length encoding
other than surrogate pairs.  Another reason I have reservation about this
book is that I'm pretty sure that the second edition is very likely to
retain mistakes/errors of the first edition (about some multibyte
encodings and character set names) I wrote about although
thanks to widespread use of Unicode, the relevance  of them is
smaller now.




Perhaps some of us should get together and pitch the idea to
O'Reilly. Certainly a HOWTO is in order.
 

 Although it's not exactly the kind you're looking for, CJKV
Information Processing
would be a useful reference for I18N engineers.
   

That and The Unicode Standard and TRs are our best resources. We 
need someone to write "Indic Information Processing", "Arabic 
Information Processing" (for all of the languages written in the 
Arabic alphabet), and maybe a few others.
 

  Yeah, that would be nice.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

srintuar26 wrote:

Chinese and Japanese (not Korean) don't use
  whitespace between "words".
   

Ooh, that makes me curious: is there a good discussion of how to
line-break Japanese text? I wonder how browsers are doing it...
 

 As far as line breaking is concerned, it's not hard to do it right for 
Japanese
text. All browsers need to do is  NOT to break where line breaking is
prohibited as specified in JIS X 14xxx(?)[1] and to break on other places
(syllable boundaris, character boundaries[2]) to make text as justified 
(on both sides)
as possible. The same is true of Korean and Chinese. It doesn't make any
difference whether space is used or not  in Japanese/Korean/Chinese.
Mozilla (and I guess MS IE as well) supports JIS X 14xxx for Japanese,
Korean and Chinese.[3]   A  harder than this is That text and that's
where you need to pay more attention. Thai line breaking rule is also
supported by Mozilla.

As I wrote earlier, programs like 'fmt' should support this.

Netscape 3.x broke lines ONLY at spaces so that some Korean web page
authors used a simple perl script to insert  tag everywhere(every
syllable boundary) linebreaking is allowed. 

[1] The prohibition rule is not a rocket science. You can easily guess
it. Here are some examples:
 - lines cannot be broken after an opening quoation mark, single
or double. That is, a line cannot end with them.
 - lines cannot be broken before a comma, a period, a question mark, an
   exclamation mark That is, a line cannot begin with them.
 - There are some Kana-specific rules I don't remember at the moment.

[2]  To generalize, I'd use 'grapheme boundaries'. See Unicode TR #29
for details.
[3] See also Unicode TR #14.
When you read UTR #14,  be aware that its treatment of
Korean linebreaking is not satisfactory. Simply put,  Korean text
can be broken at any *grapheme boundaries* (when NFC is used
for modern text, it means at any Unicode codepoint boundaries
for modern syllables) as well as at space except for  about
a dozen places where line breaking is prohibited. (see JIS X 14xxx
aforementioned). 99% of Korean text in print use layout
justified on both sides, formal or informall but TR #14 gives
a *wrong* impression that about half of Korean text use linebreaking
only on space and ragged justification style.  The author of TR #14
wouldn't listen to my feedback insisting that he's got plenty of
printed materials contradicting what I had told him which he
appreciated at the end of TR #14.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

Glenn Maynard wrote:

programmers in X care more about X support than Windows
support (which is very annoying to Windows users, who often end up with
 

old, buggy ports of X software when they get them at all).

off-topic:This is one of many reasons scientific community 
(astronomy/astrophysics for instance)
was one of the earliest groups that quickly embraced Linux. Their main 
toolsets
are all written for X11 and their Windows/MacOS ports were buggy and 
outdated,
but porting them to Linux is a lot easier.

This is actually one advantage of NFD: it makes combining support much
more important.  (At least, it's an advantage from this perspective;
those who would have to implement combining who wouldn't otherwise
probably wouldn't see it that way.)
 

  Another advantage of NFD is the consistency.  In  NFC, some characters
with diacritic marks are represented as precomposed while others are 
represented
with base character + diacritics. In NFD, all characters are represented 
the same
way except for some Korean Hangul Jamos due to 'the' very stupid mistake
of South Korean standard body that requsted the removal of  decomposition
of  cluster Jamos into  sequences of simple/basic Jamos. (Overall,
Korean script handling in Unicode/10646 is among the worst.)

By the way, I just gave lv a try: apt-get installed it, used it on a
UTF-8 textfile containing Japanese, and I'm seeing garbage.  It looks
like it's stripping off the high bits of each byte and printing it as
ASCII.  I had to play around with switches to get it to display; apparently
it ignores the locale.   Very poor.  Less, on the other hand, displays
it without having to play games.  It has some problems with double-width
characters, unfortunately.
 

  Actually, with Owen Talyor's patch posted here about a year and half 
ago(?),  'less' works
pretty well in UTF-8  under UTF-8 xterm.

Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Pango tutorial? (Re: supporting XIM)

2003-03-30 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

Unfortunately, there are no tutorials for Pango.  A developer of "Xplanet"
and I sent mails to a Pango developers (Evan Martin and Noah Levitt) to
ask that but they think Pango is not intended to be used from applications
 

   Owen Taylor is 'the' Pango developer, isn't he?



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: I18nized apps (was Re: supporting XIM)

2003-03-30 Thread Jungshik Shin

Edward Cherlin wrote:

Nadine Kano wrote one, published by Microsoft, which is 
unfortunately very much out of date and out of print. I know of 

Well,  the book is not just outdated but has some critical errors/mistakes
and Microsoft-centrism(that doesn't work well for POSIX system) 
along with useful information. BTW, I believe MS press released an 
update to
the book recently.

Perhaps some of us should get together and pitch the idea to 
O'Reilly. Certainly a HOWTO is in order.
 

 Although it's not exactly the kind you're looking for, CJKV 
Information Processing
would be a useful reference for I18N engineers.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

Edward Cherlin wrote:

On Sunday 30 March 2003 03:26 am, Jungshik Shin wrote:
 

The wish list for modern writing systems is mainly made up of 
systems with complex rendering.

Some of Indic (but some is already done)
Sinhalese
Burmese
Cambodian
Laotian
Tibetan
Mongolian
Thaana and Ethiopic are not difficult, 

 The way Ethiopic is encoded in Unicode (as 'syllabary' instead of  as an
'alphabet' ),  I don't think Ethiopic counts as a complex script. It 
could have
been if the encoding model used for Ethiopic were like that of Indic 
scripts.

but need somebody who 
wants to work on them. Cherokee, CAS, and some others fall into 
the same category.

Mandrake Linux provides keyboard support for Cyrillic, Greek, 
Israeli Hebrew, Armenian, Georgian, Bengali, Devanagari, 
Gujarati, Gurmukhi, Tamil, Thai, Laotian, and Burmese, but not 
Arabic. There is a lack of rendering for Burmese, but I have not 
had problems typing Sanskrit. Not all of the conjuncts exist in 
the fonts available, but that is not the fault of the apps.

  On the other hand, all these scripts are not supported in 
text-terminal based programs
and it's not even clear what to do in that situation.
 

I 
can't test some of the others myself, and haven't heard any 
detailed information on them. I have not found any problems with 
diacritics in Latin and Cyrillic.

 Well, you do have problems with characters with diacritics in 
Latin,Greek and Cyrillic for which
Unicode does NOT have assigned and will NEVER assign separate 
codepoints. That's
what I was talking about. There are  tens  , if not hundreds, of  
combinations
(base character + one or more diacritic mark(s)) that can ONLY be 
represented by
combining character sequences. They're necessary when you deal with
Old and Middle English for instance. Pango does not yet have support for 
those
cases (Latin,Greek and Cyrillic).  However, Pango is not much behind because
it's not much long ago that MS added support for Latin/Greek/Cyrillic
combining character support to Uniscribe.


out a way to funnel IME input through the normal character
input calls, we might well achieve CJK support in the
majority of apps.
 

 Well right now, the majority of programs in modern Linux
distros DO  work well with CJK IMEs. In case of gtk2
applications, they also work well with any gtk2 input modules
including those for CJK.  Of course, this doesn't mean that
there's very little to  do when it comes to CJ(K) support, but
I don't share Kubota-san's concern.
 


I have a Chinese HOWTO, but I can't find a Japanese or Korean 
HOWTO. Any pointers? I can type Chinese with Cangjie, Korean 
Hangul, and Japanese with romaji conversion in software where I 
know how to activate them. I would be delighted if I could do it 
in e-mail.
   

  You don't even need HOWTO documents these days because modern Linux
distros come with virtually everything you need for CJK support. (I'm here
assumming that you have a pretty good command of CJK languages, which
must be the case judging from what you have been writing on this list)
I thought you had been on this list for a while and heard of most of 
things you need
for CJK. For Korean, you can either use 'Ami' (when you launch your
program under ko_KR.EUC-KR locale or ko_KR.UTF-8 locale.
http://kldp.net/projects/ami) or imhangul 
(http://kldp.net/projects/imhangul)
for gtk2 applications. Hopefully, Pablo on this list picked up 
'imhangul' I mentioned several
times on the list and included in Mandrake 9.1. Even if not, you
can just install the rpm available at the site above. In case of Ami,
SuSe, RH, Mandrake and others have had it for a couple of years.
The same is true of Japanese IMEs and back-end servers like
Canna. For gtk2 applications, you may try im-ja 
(http://im-ja.sourceforge.net).

Jungshik



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

Perhaps not double-width, but there are plenty of non-ASCII,
non-ISO-8859-1 characters in the Unicode set that should be
interesting to U.S. programmers.
   

This is a good information.  I hope such people will hard-code
UTF-8 support up to two bytes.  Though I didn't find such softwares,
I heard there are such softwares.  We have to continue keeping watch
on i18n implement of softwares
How about "em-dash" or ligatures such as "fi" or "ffl"?  Are they
doublewidth?
 

Em-dash is a valid example, but 'fi/ffl' are NOT. Ligatures should not 
be 'hardcoded' by
those who edit documents, but have to be automatically 'summonned' at 
the rendering
layer. Anyway, other examples include  Euro sign, genuine opening 
quoation marks
and many more that have been mentioned several times by Markus Kuhn on 
this list
before.



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

- a word processor whose menus and messages are translated into your
  native language but cannot input/display text in your native language
- a word processor whose menus and messages are in English but can
  input/display/print text in your native language
Which is better?  The first one is completely unusable and the second
one is unconveinent but usable.
 

I agree with you on this point. That's why I compared the status of KDE 
in 1999-2000
with that in 2003. Back in 1999-2000, KDE/Qt people thought that 
translating messsages
is I18N, but they don't do any more and KDE/Qt supports 'genuine I18N' 
much better now.

Now brief list of examples.

- Xmms cannot display non-8bit languages (music titles and so on).

  Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1  tag 
as long as
the codeset of  the current locale is the codeset used in ID3 v1 tag.  
The problem with mp3
and id3 v1 tag is that id3 v1 tag doesn't have any means of  labelling 
the codeset used
in the tag. Therefore, you can't  view Russian id3 v1 tags (in KOI8-R ) 
and Korean
id3 v1 tags in EUC-KR in a *single* xmms session.  To work around this,
there are three ways ( we discussed this issue a couple of months agon
on this list):

1. convert all id3 v1 tags in your mp3 collection to UTF-8
 2. Give up the idea and launch two separate xmms under two 
different locales
 % LC_ALL=ru_RU  xmms &
 % LC_ALL=ko_KR xmms &

- Xft/Xft2-based softwares cannot display Japanese and Korean at the
  same time while Xft and Xft2 are UTF-8-based, because there are no
  fonts which contain both of Japanese and Korean.  This should not
  be regarded as a font-side problem, because (1) font-style principle
  is different among scripts (there are no "courier" font for Japanese)
You can use 'alias' in fontconfig  if some programs use 'Courier' 
or 'Arial' instead
of generic fonts names like 'monospace', 'serif', 'sansserif', and so 
forth.

  and (2) such fonts need developers who can design letters all over
  the world.  Pango's approach (changing font according to script)
  is needed.  

 Well,  if Xft2 is used along with fontconfig, there's no such problem. 




- There are many window managers which support "themes".  Even if the
  window manager itself is already i18n-ed, some themes cannot display
  non-Latin-1 languages.  This occurs in two cases: (1) when the theme
  specifies a font name (it is very likely) or (2) when the theme
  supplies an origial font.
 In the first case, you can work around the problem rather easily with 
'alias' mechanism
in fontconfig.

 

- There are no lightweight web browser like dillo which is i18n-ed.

I think that w3m-m17n is an excellent lightweight browser that 
supports I18N well.

- FreeType mode of XFree86 Xterm doesn't support doublewidth characters.

  Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect
he'll apply it sooner or later. After that, I'll add '-faw' option 
(similar to '-fw' option).
  

- Ghostscript.  It is known that it can handle Japanese by some
  trick (by localized version?) but it is too complex and difficult
  for me.
 It's not that hard. Most changes made by gs-cjk project have been 
folded back to
the upstream gs.  Moreover, modern Linux distros now come with ghostscript
with all the 'hard' jobs(configurations) already done for you and you 
don't have much to do.

- Even OpenOffice.org 1.0 cannot display Japanese even with Japanese
  add-on package.  I have to configure some font substitution.  Note
  that this can be done only after installation, thus I cannot read
  (translated) messages during installation at all.
 OpenOffice seems to have a serious problem when run under UTF-8 
locale. Under locales
with legacy codesets, it more or less works, but Unix/X11 version 
appears to have to be
overhauled with a new client-based font framework (fontconfig, Xft, 
pango). Its use
of the old server-side font technology makes it slow and ugly.



- Curses-basd softwares.  They must not assume number of bytes is
  same as number of columns or number of characters.  Doublewidth
  and combining character support is needed.
  As I mentioned already,   this is where we need a lot of  works. 
There are a few programs
that work well, though when linked against ncursesw.  One prominent 
example is
mutt.

 

- Perl doesn't have wcwidth().

  Well, there are a couple of Perl packages that let you  query various 
Unicode character
properties so that it should be trivial to write your own wcwidth() if 
somebody
hasn't done it already.

- Text line wrapping.  Chinese and Japanese (not Korean) don't use
  whitespace between "words".
 

 I already mentioned this issue. Programs like 'fmt' has to be 
modified, but there's already
an alternative to 'fmt' that supports Unicod linebreaking algorithm.

I feel that CJK people everytime have to keep a watch on softwares
which are already i18n-ed, because i18n support of such softw

Re: supporting XIM

2003-03-30 Thread Jungshik Shin

On Sat, 29 Mar 2003, Edward Cherlin wrote:

> aplications explicitly at present, and automatic support for
> Cyrillic, Greek, Armenian, or Hindi doesn't help Japanese users
> much.

   Automatic support for Hindi? Hmm, do I live in a world
different from yours?  It's NOT CJ(K) BUT Hindi, Tibetan, Arabic, Hebrew,
Bengali, pre-1933 Korean, Polytonic Greek (and Latin/Cyrillic with diacritic
marks for which combining characters are necessary) and other complex
scripts that have the largest wish list. Pango has supports for some
Indic scripts and Thai script, but it doesn't yet support layout of
Greek/Cyrillic/Latin with opentype layout tables.

> out a way to funnel IME input through the normal character input
> calls, we might well achieve CJK support in the majority of
> apps.

  Well right now, the majority of programs in modern Linux
distros DO  work well with CJK IMEs. In case of gtk2 applications,
they also work well with any gtk2 input modules including
those for CJK.  Of course, this doesn't mean that there's
very little to  do when it comes to CJ(K) support, but
I don't share Kubota-san's concern.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-29 Thread Jungshik Shin

On Sat, 29 Mar 2003, Pablo Saratxaga wrote:

> On Sun, Mar 30, 2003 at 12:37:49AM +0900, Tomohiro KUBOTA wrote:
>
> > However, I am often annoyed by people who think supporting European
> > languages is more important than supporting Asian languages

  I don't  think you meant that way, but I found it very annoying
that some people and software use 'Asia' to mean only CJK.
One prominent example is Sun's Staroffice and Openoffice.
That's almost an insult to people of Indian subcontinent, Southeast Asia,
Central Asia, and Southwest Asia.

> Are there such people?

  There might be some,  but as I wrote in my response to Kubota-san,
I18N-mind is much more widely spread than 5 years ago and
I agree to your assesment of I18N in Linux below.

> Note also that, currently, I do'nt agree with you that i18n of programs
> is low; to the contrary, the majority of programs have good to
> very good i18n support.

> > How should I call such people?  I know they are never "racists" in its
> > original meaning.
>
> "ethno-centrist" is the word you are looking for I suppose.

  If they're from Western Europe, 'Western-Eurocentric' :-)

> Tell me about one single current major program/project that doesn't have
> i18n support (maybe there are, and I'm just not aware of it (probably because
> a modern software without i18n support is not worth it in my eyes).

  One example is mkisofs in cdrtools. It's 'single-byte-centric'
and the project maintainer has yet to accept a patch for multibyte support
(including UTF-8). Sonner or later, I'll send him a new patch in such
a form that he find it hard to leave it aside.

  Other examples include fmt, and other textutils, mc (it sorta works,
but needs a lot of work to be fully I18Nized and UTF-8 friendly), lynx
(one MIME charset at a time is well supported, but it needs multilingual
ability as found in w3m-m17n. I hope major linux distros include w3m-m17n
instead of plain w3m) and Pine (it works fine for a single MIME charset,
but not yet multilingual and screen handling is single-byte centric. My
UTF-8 patch solves only a small subset of these problems). 'less'
still needs more work (Owen's patch is better than my patch
that went into less 37x.)

   Some terminal emulators and terminal-based/-like programs need
to pay more attention to East Asian Width (UTR #1? ). xterm has an
option '-cjk-width' and other programs need a similar option/feature.
Vim needs this. Its current column width cacluation routine is not based
on wcwidth(). (I'll plan to fix this soon.  It's very easy and Markus's
wcwidth and wcwidth_cjk come very handy. It's better to use them than
wcwidth from glibc which is locale-dependent.) gtk2 font selection
widget should optionally offer a way to designate a *separate*
'monospace' font for 'double width'. So does Qt's font selection widget.
It's naive to believe that fontconfig and pango can do the magic for
this case as evidenced by the fact that MS Word under MS Windows
even with  equivalents of fontconfig and pango lets
users select East Asian font separately.

   Full-screen text based programs need to be linked against
ncursesw rather than ncurses or slang (how good is slang's
UTF-8 and multibyte support?) and delegate as many  screen-manipulating
tasks to ncursesw as possible . When used with mutt, ncursesw
appears to work well under UTF-8 locale.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-29 Thread Jungshik Shin

Tomohiro KUBOTA wrote:

Hi,

From: Jungshik Shin <[EMAIL PROTECTED]>
Subject: Re: supporting XIM
Date: Thu, 27 Mar 2003 18:38:51 -0500 (EST)

 That's not a problem at all because there are Korean, Japanese
and Chinese input modules that can coexist with other input
modules and be switched to and from each other. With them, you
don't need to use XIM.

...

One point: Many Japanese texts include Alphabets, so Japanese people
want to input not only Hiragana, Katakana, Kanji, and Numerics but
also Alphabets.  I imagine Korean people want, too.  In such a case,
switching between Alphabet (no conversion mode) and conversion mode
has to be achieved by simple key typing like Shift + Space.  

 There are two switchings involved here. One is the intra-module mode/level
switching and the other is inter-module switching.
What you want for Japanese (and correctly guessed Koreans also need) can 
be easily
achieved  by the intra-module mode swtiching method of a single gtk2 
input module.
For instance, all 5 modules included in imhangul Korean gtk2 input modul
suite interpret 'shift-space' as the toggle switch between Korean and 
English
input modes and 'F9' for Hangul-to-Hanja conversion. I don't see any reason
the same cannot be done for Japanese gtk2 input modules.  I believe
there's nothing in gtk2 input moduel framework that prevents a
single input module from supporting multiple 'modes' (or levels) that 
can be switched
around if necessary.

As for inter-module switching, I guess some  more work is necessary.
It seems like the only way to switch to another input module is through
pop-up menu that can be 'summoned' by right-clicking. However,
combined with KDE keyboard switcher (I got to know that gnome2
has a similar utilitiy) that appears to be a simple wrapper over
xsetkeymap, you don't have to right-click very often, I believe.
Another point: I want to purge all non-internationalized softwares.
Today, internationalization (such as Japanese character support) is
regarded as a special "feature".  

However, I think that non-supporting
of internationalization should be regarded as a bug which is as severe

 I agree and think most, if not all, people on this list agree, too. 
Thanks to
a lot of smart people from all over the world including a lot of 
contributors
like you from Japan, free/open source communitiy has  taken several,
if not a lot more, huge steps forward in terms of I18N  during
the last few years. Back in 1998, when I read Drepper's paper
on I18N in glibc, the problem appeared to be overwhelming. As lately
as 1999/2000, KDE team mixed up L10N and I18N and claimed that
KDE 1 supports CJK while all it actually had was translated messages
in CJK.  Now look what we have. gtk2/gnome 2/pango, KDE3/qt, glibc2,
XFree86, Xft/fontconfig, freetype, _NET_WM extension, ICU,  Perl 5.8,
xterm/mlterm, vim, yudit,  Omega/Lambda, many others I forgot to mention

means users have freedom to choose.  Such a freedom of choice must not
be a priviledge of English-speaking (or European-languages-speaking)
people.  Do you have any idea to solve this problem?

No question about that. What do we have to do? Well, just as we have
done so far,  I think we have to keep working as well and as hard as we
can.  I think I18N-awareness and I18N-mind are now widespread
among developers worldwide and  I'm not worried as much about
CJ(K) as you're. However, we still need to go a long way
to (fully) support complex scripts of South Asia, SouthEast Asia,
SouthWest Asia (Middle East) , Korea(Hangul is a complex
script)  and Europe/Africa/North America(yes, Europe !
Latin/Greek/Cyrillic alphabets are complex, too !!)
Of course several Japanese companies are competing in Input Method
area on Windows.  These companies are researching for better input
methods -- larger and better-tuned dictionaries with newly coined
words and phrases, better grammartical and semantic analyzers,
and so on so on.  I imagine this area is one of areas where Open
Source people cannot compete with commercial softwares by full-time
developer teams.
  As some linguists observed, Japanese writing system seems to offer a 
number
of fascinating  opportunities for  linguists/computer programmers to put 
their
mature and immature ideas to test.

How about Korean?

 In case of Korean, conversion to Hanja(Chinese characters) is not such 
a important issue
as in Japan. Simple dictionary based word and character look-up appears 
to be sufficient
for most Korean users because they rarely use Hanja. As for Hangul 
input(putting
aside pre-1933 orthography Korean for the moment), there are two major  
keyboard layouts
(like qwerty vs dvorak)  with a few variants, but the situation has been 
stable for more than a
decade.   In other words, there  doesn't seem to be  much room for  
innovation because
Korean input is  not much more complex than  input of 
Latin/Greek

Re: supporting XIM

2003-03-27 Thread Jungshik Shin

On Thu, 27 Mar 2003, Pablo Saratxaga wrote:

> [I Cc: to gnome-i18n as it concerns mainly the gtk2 input]
>
> On Thu, Mar 27, 2003 at 04:17:58AM -0500, Jungshik Shin wrote:
>
> >   As mentioned before, this is possible in GTK2 applications.
> > Fire up gnome-terminal and right-click in any text input area
> > and you'll get a pop-up menu from which you can choose a gtk2
> > input module a la Windows.
>
> But you are limited to only one "X input method"...
> That is the big problem; it would be much better if it would be possible
> to have *seceral* X input methods, like in yudit.

  That's not a problem at all because there are Korean, Japanese
and Chinese input modules that can coexist with other input
modules and be switched to and from each other. With them, you
don't need to use XIM.  For instance, imhangul gtk2 input module for
Korean(http://kldp.net/projects/imhangul) is much more powerful than
Ami. I haven't tried Japanese or Chinese gtk2 input module, but judging
from the way imhangul works, it should be possible to write Japanese and
Chinese input modules as powerful as, if not more powerful than, Japanese
and Chinese XIM servers. BTW, this also works *along* with Xkb. So, if
you have KDE 'keyboard switcher'(which appears to be a simple wrapper
over setxkbmap and of which feature can be done by setxkbmap in non-KDE
environment.),  you can switch between all gtk2 input modules, XIM (either
Compose or one of XIM servers ) and as many Xkb layouts as you want.

> me (I can only type some accented letters, while with an UTF-8 locale
> and xkb keyboard (trough "X input method") I can type much more.

   You meant 'Compose'(the built-in XIM server) by 'xkb keyboard',
didn't you?

> I never use the built-in input of gtk2, as it is too deficient for
> In particular esperanto accented letters, azeri schwa, and others.

   You can just Xkb for what it's easier to type with Xkb than
with gtk2 input modules. You wrote as if there's an inherent limit in
gtk2 input modules, but obviously there isn't.  It only depends on how
well any given module is written and designed.

> But then, I cannot type in japanese...

  There is at least one Japanese gtk2 input module as I wrote above.
You just have to install it because it doesn't come default with
gnome 2.x.

> Well, I don't always use all of them, as I don't speak all those languages;
> but a lot of people may have needs that cover several input methods,
> for example Korean and Japanese, or Japanese and French (something
> almost impossible to do properly right now, if you have Japanese input
> you lost some accents), or Chinese and accented pinyin...

  With gtk2 input modules, you can have all of them.

> gtk2 input methods for translitering cyrillic or other scripts are
> useful, but not required.
> more useful are the methods to type in transliteration for scripts
> that use sillabaries with a wide range of combination (korean, geez,
> inuit-cree, etc.),

   Well, Korean script is not usually classified as a syllabary
although it could be many different things depending on how you look at
it :-). Anyway, if there's a need for them(transliterating input methods
for Ethiopic, Inuit, Korean, etc), somebody has to write input modules
for them.  Perhaps, taking advantage of what's done in yudit would be
a good idea when writing such a input module.

> But there is still missing the ability to use various XIM input methods
> and switch between them.

  It'd be nice to have that feature, but it's not necessary because
scripts that usually require XIM servers can be and are
supported by gtk2 input modules.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-27 Thread Jungshik Shin

On Wed, 26 Mar 2003, Edward Cherlin wrote:
> KDE has a decent keyboard and IME switcher in the KDE Control
> Module. You can install it on the toolbar and choose your hot
> key combinations from a drop-down menu.

  Thanks for the info. I didn't know KDE has this feature. However,
does it work for switching XIM's as well? It lets me switch among
as many keyboard laouts as I want, but it doesn't look like
it supports switching between XIM's. Hmm. is it time to upgrade my
KDE?

  Anyway, I found gtk2 input module switching very nice and hope many more
gtk2 input modules come standard with popular Linux distros.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: supporting XIM

2003-03-27 Thread Jungshik Shin

On Wed, 26 Mar 2003, Edward H Trager wrote:

> On 25 Mar 2003, H. Peter Anvin wrote:
>
> > Indeed.  It would be nice to at some point in the future be able to
> > edit, for example, Swedish-langauge document and suddently decide I
> > need to insert some Japanese text, call up the appropriate input
>
> I second that!  I hope the XFree86, KDE, and Gnome people are reading this
> and thinking about it (especially in light of recent events occuring

  As mentioned before, this is possible in GTK2 applications.
Fire up gnome-terminal and right-click in any text input area
and you'll get a pop-up menu from which you can choose a gtk2
input module a la Windows. Many more gtk2 input modules have
to be written, but at least the framework is there. Besides,
as others wrote, IIIMF is another option although I haven't tried
it.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl script to hunt for malformed/overlong UTF-8 sequences

2003-03-18 Thread Jungshik Shin

Jungshik Shin wrote:

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.
 


 It may be a good idea to filter out 'UTF-8' representation of 
surrogate codepoints

(0x0d800 - 0xdfff) as well. That is, the following can be added to 
$utf8malformed

  \xed[\xa0-\bf][\x80-\xbf] 
In addition, non-characters (0x and 0xfffe in all planes) may as 
well be filtered out.

 \xef\xbf[\xbe-\xbf]|
 [\xf0-\xf7][\x8f,\x9f,\xaf,\xbf]\xbf[\xbe-\xbf]
( and 5 and 6byte ones if you want)





--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Perl script to hunt for malformed/overlong UTF-8 sequences

2003-03-18 Thread Jungshik Shin

Markus Kuhn wrote:

The attached Perl script print cuts from all lines in a plaintext file
that contain non-ASCII bytes. With option -m, it looks for malformed and
overlong UTF-8 sequences instead. Usefull for reviewing files with
unknown encoding manually.
 

 It may be a good idea to filter out 'UTF-8' representation of 
surrogate codepoints
(0x0d800 - 0xdfff) as well. That is, the following can be added to 
$utf8malformed

  \xed[\xa0-\bf][\x80-\xbf]

Jungshik





--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 and LaTeX

2003-03-11 Thread Jungshik Shin

Markus Kuhn wrote:

Frank Mittelbach ([EMAIL PROTECTED]) has posted on
2003-01-07 on [EMAIL PROTECTED] the beginnings of a far more
lightweight UTF-8 support for LaTeX within the inputenc framework, which
will hopefully find its way into the next release:
 http://www.latex-project.org/cgi-bin/ltxbugs2html?pr=latex%2F3480

 I'm not sure how far LaTeX can get stretched to support Unicode. It 
appears
that Lambda based on Omega( http://omega.cse.unsw.edu.au:8080)
is one of better ways, if not the way, along with true/opentype fonts and
dvi drivers like  dvipdfmx(http://project.ktug.or.kr/dvipdfmx) to get 
Unicode fully 
supported. 

Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: UTF-8 Editors? (Was XML and tags)

2003-02-22 Thread Jungshik Shin

On Sat, 22 Feb 2003, Roozbeh Pournader wrote:

> On Sat, 22 Feb 2003, Edward H Trager wrote:
>
> > It turns out that the version of vim that I have does indeed work under
> > xterm for an assortment of LTR languages (Indian languages not tested),

  It wouldn't work for Indic scripts because xterm does not support
Indic scripts (although it supports Thai). It's not even clear what
VT100/220 terminal emulators should do for them.

> > but not Arabic (the only RTL language tested)
>
> Arabic is not in vim yet. They are putting it in now that we're talking,
> and there have been a lot of discussions on something called 'cream' that
> is a vim distribution that has included the Arabic patch.

  You meant  a standalone-gui vim (e.g. gvim) as opposed to vim running
inside a terminal emulator, didn't you? Without RTL
scripts supported by the term. emulatore it's running under, I presume
that it'd be very hard to support Arabic in vim.  BTW, there's a port of
gui-based vim to gtk2(and pango) which reportedly supports RTL scripts
See http://www.opensky.ca/gnome-vim/todo.html. The latest patch
is not the one linked there but you shuold get it at
http://regexxer.sourceforge.net/vim.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

mutt and ncursesw

2003-02-18 Thread Jungshik Shin

On Tue, 18 Feb 2003, Nikolai Prokoschenko wrote:

> On Tue, Feb 18, 2003 at 03:57:30AM -0500, Glenn Maynard wrote:
> > > mutt from Debian doesn't have any problems at all!
> > Debian has a "mutt-utf8" package that's compiled against ncursesw.
>
> Not quite - it's some kind of additional packages - maybe it includes just
> the updated binary, I don't really know or care - it works!

  Last time I checked, mutt compiled against the ordinary ncurses
(as opposed to ncursesw) does NOT work for characters with East
Asian width of 'full'. You may get an impression that it works
because you use it only for chars. with East Asian width of 'half'.
For CJK, compiling mutt against 'ncursesw' is a must.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: dos2unix and UTF-8 BOM

2003-02-17 Thread Jungshik Shin

On Sun, 16 Feb 2003, Roozbeh Pournader wrote:

> I was thinking about the annoying BOM-like sequence that Windows 2000's
> and XP's Notepads are putting at the beginning of UTF-8 files. The byte
> sequence "EF BB BF" that's invalid as a header/signature in Unix UTF-8.
>
> Shouldn't 'dos2unix' be patched to also remove this sequence?

  That would be useful. However, that doesn't work very well if multiples
files are fed to it (e.g. 'cat a b c | dos2unix'). And, that's why
we all hate UTF-8 BOM ;-).

  How about these?

 Incidentally, it just occurred to me that  ftp/ssh clients may offer an
user-configurable option for the  automatic removal of  'UTF-8 BOM' at
the beginning of a text file in UTF-8 when moving files from Windows to
non-Windows platforms (Unix/Unix-like OS and MacOS). The same is true
of Kermit (Frank, are you here?). All those tools can be configured
to translate between three (and nowadays even more?) EOL conventions,
CF/LF/CR,LF for text files. Then, the automatic removal(and addition if
that's regarded as necessary) of UTF-8 BOM at platform boundaries
would be as useful.

   As for web servers, a configurable option can be added to remove
UTF-8 BOM at the beginning of text/* files(they serve). For instance,
it's easy to write a simple module for Apache(used at Unicode.org web
site) to do that.

   VFAT, NTFS and  FAT for Linux can be modified in a similar way.
And, editors like Vim (which automatically detects EOL used in
text files) can do the same.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mp3-tags, zip-archives, tool to convert filenames to UTF

2003-02-17 Thread Jungshik Shin

On Fri, 14 Feb 2003, Jungshik Shin wrote:
> On Fri, 14 Feb 2003, Nikolai Prokoschenko wrote:
>
> > On Fri, Feb 14, 2003 at 07:01:56PM +0100, Helge Hielscher wrote:
> >
> > > 1) I have some mp3-Files with ID3-Tag, most of these files use the
> > > ISO-8859-1 encoding, but some use a russian encoding. Which programms
> > > can display the russian ID3-Tags? I have tried XMMS, but with no
> > > success.
>
>   If you have a mix of mp3 files with id3v1 tag in ISO-8859-1
> and other mp3 files with id3v1 tag in KOI8-R, the only way to display
> both kinds of tags correctly *simultaneously*(in a single xmms
> session) is to convert both tags to UTF-8 and run xmms under UTF-8 locale.

  One problem with this  is that most portable mp3 players in the
market can't handle UTF-8 although they support a dozen or more
languages. Consequently, you may have to reconvert id3v1 tags
in your mp3 files if you need to store them in portable
mp3 players. They shpport multiple languages by assuming that
there's a one-to-one correspondence between languages and
encodings. This is plainly wrong, but there's not much they
can do given that id3v1 tag does not have any means of indicating
which encoding is used and for the vast majority of mp3 files
circulated and made on the net the aforementioned one-to-one mapping
is valid.

> BTW, id3v2 tags don't have this problem.

  We can just hope that id3v2 will be widely used soon and
a new generations of mp3 portable players will support it.

  BTW, a number of PDAs, mobile phones and other devices
might share the problem arising from the misguided assumption that
languages/scripts and encodings are tightly bound to each other(the
same is true of stupid web mail services like Hotmail, Yahoo mail,
etc). Hopefully, more wide use of Linux in those devices and better
UTF-8 support in Linux will change the situation.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: redhat 8.0 - using locales

2003-01-11 Thread Jungshik Shin

On Fri, 10 Jan 2003, Markus Kuhn wrote:

> strongly prefered that locale names do not use a country name at all,
> unless it is necessary to distinguish between countries. The only excuse
> to do so is usually the currency field, which nobody uses anyway and

  LC_COLLATE is sometimes region/country dependent. For instance,
ko_KP and ko_KR have different collation rules (although I wish
there were a common set of rules shared by ko_KR and ko_KP).
In addition, differences between zh_* in LC_MESSAGES are not
trivial.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: hanzi vs kanji

2003-01-04 Thread Jungshik Shin

On Fri, 3 Jan 2003, Maiorana, Jason wrote:

> >Can we please maintain the distinctions between
> >1. language,
> >2. script, and
> >3. typeface 'category' or other typeface differences.
>
> Thats really the question: Is the difference between
> Hanzi and Kanji more one of typeface or of script.

> I would argue that it is a real script difference,

   I strongly disagree with you on this point.
Most people on the Unicode list would agree with
me. If they're different scripts, CJK Unification
should be overthrown right away.

> but it is typically implemented as a typeface
> difference. A character in these scripts do have
> a precise set of radicals, stroke order, and
> proportion.

   This is only the case if you regard anything
other than what Japanese MoES(Min. of Education and Science) standardized
as 'non-Japanese'.  My grandfather, father and I(Koreans) could write
a single Chinese character with different stroke counts and sometimes
even differently looking radicals, but all of us know what we mean.

> (Stylization is something applied
> afterwards, deviating from the script norm.)

   Who has the final say in the script norm?
I don't want Korean MoE(Min. of Education)
to tell me to change the way I write
some Chinese characters. My grandfather would
get enraged if  some ignorant beuraucrats
in Seoul wanted him to change the way
he writes.

> It is certainly possible for some to overcome this
> difference, and read their own language despite
> its being in another script, but that does not
> prove that they are identical scripts.

   Neither does it prove that they're different
scripts.

> The difference between fraktur and arial however,
> is purely one of typeface, and seems relatively
> trivial.

   If it's trivial, the diff. across CJK glyph
variants is far far far  more trivial.

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Red Hat 8 now uses UTF-8 by default for all non-CJK users

2003-01-03 Thread Jungshik Shin

On Thu, 2 Jan 2003, seer26 wrote:

>
> > And different glyphs are needed in a document which wishes to show the
> > difference between English and German conventions of the 1920's. Does
> > that mean that Fraktur and Antigua should have been encoded
> > seperately?
>
> Somehow I think the differences are somewhat more significant than that.

   No way!!  The difference between Franktur and Antigua is
far more significant that that between 'Chinese' glyphs and 'Japanese'
glyphs.  When I tried to read German newspapers dating from 1920's,
I had to 'decipher' almost every single letter.  Japanese readers
wouldn't have to cope with that degree of the difficulty unless their
'pattern recognition' ability is crippled significantly.  Please, don't
just present 'here-say', but get hold of ISO 10646-1:2000 (it's about
80CHF) and see with your own eyes how much different they are.

> Do you think it is possible to fully represent traditional Chinese and
> Japanese adequately in a single font?

  A single opentype font with multiple glyphs for a single Unicode
character can be used if you're really concerned about the
difference.

> Ive read comments by some Japanese claiming that a large number of the
> kanji in a chinese-oriented font seemed ill-proportioned, even though
> they contained the exact same stroke order (and not in a stylistic
> sense).

Whoever said that, they should dig up their family archives
and try to read old letters/diaries of their grandfathers and
grand-grandfathers.  Tell them to see which is more difficult
to read,  their grand-grandfathers' handwritting or
Japanese text rendered with 'Chinese' font.

Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Japanese Input under RH8

2002-12-13 Thread Jungshik Shin

On Fri, 13 Dec 2002, Mike FABIAN wrote:
> "Jim Z" <[EMAIL PROTECTED]> さんは書きました:
>
> > I tried your tip to bring up kinput2
> I.e. you tried
>
>  export XMODIFIERS="@im=kinput2"
>  LANG=ja_JP LC_ALL=ja_JP kinput2 -xim -kinput -canna &
>  LANG=en_US.UTF-8 LC_CTYPE=ja_JP.UTF-8 program...

  I thought you had written that the following also works with
a new kinput2 (suppose LC_CTYPE/LC_ALL is not defined.) and that
might have been what Jim tried.

   export XMODIFIERS="@im=kinput2"
   LANG=ja_JP.UTF-8 kinput2 -xim -kinput -canna
   LANG=ja_JP.UTF-8 program-where...

Actually, I've just tried it and it worked.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: [Fonts]Re: Xprint

2002-12-11 Thread Jungshik Shin

On 11 Dec 2002, Juliusz Chroboczek wrote:

> Sorry for mis-reading your mail, then.

  No problem :-)

> JS>   As for complex script rendering, it's possible...

> You'll doubtless agree with me that what you're describing are a
...
> for decades now -- it's high time to move on.

  Yes, I agree with you, but somebody needs to do the work.
Actually, the most difficult part may  not be programming but may be
getting/making some intelligent fonts (opentype or AAT) for complex
scripts. For Indic scripts, things are going pretty well and the number
of freely available opentype fonts for Indic scripts are increasing. For
Korean, it's not so good as I wrote before. I have yet to see a single
free opentype font.

  BTW, you'll be surprised to read comments made by some people at
. They want
to kill PS module in mozilla in favor of Xprint.

> JC> I'm a little bit suspicious about their choice to use Type 42 CIDFonts

> JS> Given that truetype fonts are much easier to come by than genuine
> JS> CID-keyed fonts for CJK (which is also true of truetype fonts vs PS
> JS> type 1 fonts for European scripts although to a lesser degree), I guess
> JS> the choice is all but inevitable...
>
> I may have misunderstood something, but last time I checked the
> approach was to use Type 42 CIDFonts *only*.  These are currently a
> fairly rare beast (only supported since version 3012, if memory serves).

 I also thought that's the case. However, Brian Stell changed the plan
(see http://bugzilla.mozilla.org/show_bug.cgi?id=144663. ) and he's now
gonna use type 8 (neither type 11=what you're calling type42 CIDFont =
CIDFont type2 nor type 42). What's type 8 font, btw?

> JC> [using Type 42 CIDFonts] will require many users to rasterise
> JC> everything with ghostscript on the host, with all the ensuing
> JC> performance and printing quality issues.

Because you wrote the above, I thought that you had reservation about
doing everything on the host side regarding printers as dumb devices which
may sacrifice the printing quaility. I also thought that you prefer to
leave as much as possible for PS printers to take care of. That's why I
didn't even mention the most certain way to produce portable PS output
(type3 bitmap) and I wrote about the percentage of end-users owning
PS printers.

> Conversion to Type 1 fonts works everywhere, gives excellent results,
> and the code is readily available (ttftpt1).  Finally, if everything

   Does this conversion code also work for large CJK ttf fonts(with more
than 256 glyphs)? Or, does it also support conversion to composite
font(OCF?)?

> As you see, I am not arguing against support for CIDFonts; I'm merely
> stating that making Type 42 CIDFonts the only download format for TTFs
> makes me er... suspicious.

  I'm not against producing portable PS, either :-).  However,
I think the portability of PS output doesn't matter much considering
the way printing is handled these days in Unix/Linux.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: [Fonts]Re: Xprint

2002-12-10 Thread Jungshik Shin


On 10 Dec 2002, Juliusz Chroboczek wrote:

JS>   Even with this weakness, Xprint is by far the best printing
JS> solution available at the moment for Mozilla under Unix/X11
JS> because postscript printing module of Mozilla does not work very
JS> well yet

JC> Xprint might work for CJK fonts,

  It does work for CJK now. Especially version 0.8 of Xprint with
truetype font support works pretty well. Even the PS output
produced by 0.7 with X11 bitmap fonts doesn't look that bad.

JC> although I'm a little bit suprised at  your enthusiasm for the thing.

  I'm not so  enthusiastic about it as you may think. A better
word to characterize what I think about it is
ambiguity.  See my postings to mozilla-i18n newsgroup
. When I wrote
'by far the best', I meant _as of now_ it gives the best match between
the print out and the screen rendering. For CJK web pages, Mozilla PS
module can't do that because only *one* PS font for each language can be
specified. That is, on the screen, Mozilla(especially Mozilla-Xft) can
be a  good implementation of CSS, but on the print out, it cannot.
Xprint is not perfect, but it's better than printing out everything(CJK
and non-Western European) in a single font (specified in pref. file
which has to be hand-edited
by end-users.). Besides, complex script cannot be printed out at all by
Mozilla under Unix without Xprint. With Xprint, it's possible to print
out web pages in complex scripts  provided that  you can render them
on the screen with Mozilla-X11core. That's a big difference.

JC> There is no way, though, how Xprint
JC> could work for complex scripts without standardising on glyph
JC> mappings.

  As I understand it, Xprint is a specialized form of X11 server
combined with some X clients. Therefore, I think it has all sorts of
weakness found in server-side font model we have been moving away from.
It's not fast and nor efficient (compared with client-side font technology)
and it doesn't support 'modern' CSS-based font selection/resolution at
the same level as provided by fontconfig. Nonetheless, it works _now_.

  As for complex script rendering, it's possible to print them out
as I wrote above and my test with Old Korean showed. (see
 http://bugzilla.mozilla.org/show_bug.cgi?id=176315). Standardizing
on glyph mapping is not a requirement if we just deal with a single
application program(e.g. Mozilla). Mozilla-X11 has a way to map the last
two fields of XLFD to a  mapping between a string of Unicode characters
and a sequence of glyphs. That's what Mozilla-X11 uses to render Indic
scripts, Thai and Hangul Conjoining Jamos. (Mozilla doesn't yet support
opentype fonts at least under X11. Some Pango code was borrowed but
that's not from pango-xft but from pango-x). Because Xprint module of
Mozilla shares many things with Mozilla-X11corefont/Mozilla-Gtk, without
doing anything, Xprint just works when it comes to printing out web pages
in Indic scripts, Thai and Old Korean.

  Of course, I'm well aware that we have to use opentype fonts with
gsub/gpos tables for complex script rendering.  However, we also need a
short-term solution that works now.  For instance, there is not a single
opentype font freely available for old Korean. The situation is much
worse than that for Indic scripts for which free opentype fonts began
to emerge. In the meantime, we have to resort to font-specific-encoding
hacks.

JC> There is also no way[1] how Xprint could implement
JC> dynamically generated fonts, as required for example by CSS2.

 I'm a bit confused as to what you meant by 'dynamically generated
fonts'. Did you mean 'web fonts'?  Can you tell me what you meant?

JC> The right approach is obviously to do incrememtal uploading of fonts
JC> to the printer at the PS level, as the Mozilla folks are trying to do.

  I totally agree with you provided that the font resolution mechanism
is tied with fontconfig.

JC> I'm a little bit suspicious about their choice to use Type 42 CIDFonts

  Given that truetype fonts are much easier to come by than genuine
CID-keyed fonts for CJK (which is also true of truetype fonts vs PS
type 1 fonts for European scripts although to a lesser degree), I guess
the choice is all but inevitable(perhaps OpenOffice also adopted this
approach). Do you have a better idea?  Judging from your reservation about
the rasterization on the host side, what you're thinking of cannot be
converting all the glyphs into bitmap and putting them in the PS output.
Anyway, I believe this 'mini-project' for Mozilla printing has be 'glued'
with fontconfig in CSS2 font resolution so that the screen rendering
and PS output use the same set of fonts.

What I can think of as an alternative to embedding type 42 PS font(type
2 CIDFont) is just to refer to CID-keyed fonts/type 1 fonts in the
PS output and let a real PS printer or ghostscript do the rest of the
job. This is similar to what the present PS module for Mozilla does.
However, in order to get

RE: mixing LANG and LC_CTYPE

2002-12-10 Thread Jungshik Shin

On Tue, 10 Dec 2002, Maiorana, Jason wrote:

> >> If this is not the case, is there any locale which will correctly
> >> ctype() all of unicode?
> >
> >  There's NO single 'correct' way although there can be a 'generic'
> >isupper, islower, toupper, tolower and so forth work differently
> >on the language/region of the locale.
>
> The unicode standard itself seems to provide standard mappings of
> upper, lower, and title case. The locale system does not seem to

  Unicode standard does  provide the *default*, but that default can be
tailored and overridable depending on language/locale/region.
That is, what's correct for English may not be correct
for Turkish, Irish, Swedish, Dutch, Russian and Bulgarian
however minor those differences might be. That's what I meant when
I wrote that there is not 'the' correct way.

> but I dont see why an "isspace" function couldnt
> work correctly for all of unicode/all languages.

  I also think that *some* categories in LC_CTYPE
appear to be language-neutral, but I can't be 100% sure. You never know.

> The main point I'm getting at, is that even if I'm in en_US.UTF-8,
> why cannot the upper/lower converter make an effort for the
> other languages, such as vietnamese, which have obvious case
> conversions to any roman-alphabet user.

  I haven't disputed and won't dispute this point. I totally
agree with you on this point. I want en_US.UTF-8 or any ll_CC.UTF-8 to
work reasonably well for the full repertoire of Unicode.  That's exactly
what Unicode is for among other things.  However, you cannot assume that
what's correct for English as used in US is also correct for French as
used in Canada, and other lang/scripts/region combination.

> Duplicating the full case conversion tables for all installed
> locales does neem a bit redundant... Instead maybe a small file
> like:

  No doubt there should be an efficient way to share what's common
across lang/region/scriptsh and store only the 'tailoring delta'
separately for each lang/region/script.  Well, someone might say that
disk is cheap.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: mixing LANG and LC_CTYPE

2002-12-10 Thread Jungshik Shin

On Tue, 10 Dec 2002, Pablo Saratxaga wrote:

> On Tue, Dec 10, 2002 at 10:37:56AM -0500, Noah Levitt wrote:

> > Should a combination like LANG=fr_FR LC_CTYPE=en_US.UTF-8
> > result in something equivalent to LANG=fr_FR.UTF-8?
>
> Isn't what you are looking for:
>
> LC_ALL=en_US.UTF-8
> LANGUAGE=fr
>
> that is, all locale stuff defined to en_US.UTF-8, and French translations
> if available.

  Oh, no.. I hate non-standard LANGUAGE and LINGUA and their friends.
IMHO, they should have never been introduced.  Why can't we just
use LC_MESSAGES?   In his case, he can use 'LANG=en_US.UTF-8 and
LC_MESSAGES=fr_FR.UTF-8' if that's what he's looking for.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: mixing LANG and LC_CTYPE

2002-12-10 Thread Jungshik Shin

On Tue, 10 Dec 2002, Maiorana, Jason wrote:

> >> Should a combination like LANG=fr_FR LC_CTYPE=en_US.UTF-8
> >> result in something equivalent to LANG=fr_FR.UTF-8?

  Even in theory, no if there are differences between French and
English in character classification, 'case conversion' and so forth.
Why don't use just use 'LANG=fr_FR.UTF-8' if that's what you want?

> what about
> LANG=fr_FR.UTF-8
> LC_CTYPE=en_US.UTF-8
> ?

  Nothing wrong with this. All LC_*'s other than LC_CTYPE would
follow LANG, but is that what he want?

> for UTF-8, the ctype information would be the same, right?
> (case, whitespace, etc )

  No, they're language-region dependent.

> If this is not the case, is there any locale which will correctly
> ctype() all of unicode?

  There's NO single 'correct' way although there can be a 'generic' default.
isupper, islower, toupper, tolower and so forth work differently depending
on the language/region of the locale.

> When programming, I avoid the ctype function itself. I think its
> better to convert to utf-8 on input (if its not already) and
> use generic unicode ctype functions.

   The keyword here is 'generic' and is not applicable to all languages
all the time.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: cxterm cut/paste: COMPOUND_TEXT, UTF8_STRING?

2002-12-09 Thread Jungshik Shin

On Mon, 9 Dec 2002, Tony Laszlo wrote:

Hi,

> I found this 1999 post in the mozilla-i18n archives from Jungshik.
> http://www.geocrawler.com/archives/3/113/1999/7/150/2441628/
>
> I seem to be having a similar issue, at the moment, with Chinese
> copied from cxterm and pasted into Mozilla (or yudit, or an mlterm
>  window). RH7.1, latest Mozilla, latest yudit, kde.

  As I wrote there, cxterm and hanterm are to blame because
they violate X11 ICCCM.  Mozilla, yuidt,mlterm and kde are doing just
what they're supposed to do. (I mentioned a work-around that may be
implemented by 'programs on the receiving end' in my posting, but I
think that's not a good idea.) Mozilla has since implemented UTF8_STRING.
'The' way to solve this problem is to fix cxterm and hanterm to support
UTF8_STRING and COMPOUND_TEXT. kterm(Kanji term) and rxvt(cjk) support
COMPOUND_TEXT and  mlterm and xterm(XFree86)  support both.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: UTF-8 wakeup call

2002-12-07 Thread Jungshik Shin

On Sat, 7 Dec 2002, Kent Karlsson wrote:

> > The mappings used are at least also from the RFC 1345 (recode uses that)
> > or the IS 15897 which uses many if the same names and mappings.
> > Specifically I have seen that Linux is *not* using the Unicode data
> > because of copyright issues.
>
> Hmmm.  From http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html:
>
>   Limitations on Rights to Redistribute This Data
>
>   Recipient is granted the right to make copies in any
>   form for internal distribution and to freely use the
>
> I don't see this as restrictive for use in Linux.  I'm sure Unicode
> consortium would like to see its data being used also in open source

   glibc 2.x may not use them, yet. However, glib(and other libraries
built on top of it) indeed makes an extensive use of Unicode data files.
So do Perl, Yudit, Mozilla and other free/opensource programs/projects
that run on Linux.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Input under RH8

2002-12-07 Thread Jungshik Shin

On Fri, 6 Dec 2002, Maiorana, Jason wrote:

> First, thanks to Jungshik Shin & Mike FABIAN for your
> replies.

 You're welcome :-)

> I surmise that the current state of RH8 is that it is not
> yet suitable for entry of all languages simultaneously.
> (flaws in XIM itself being part of the problem)

 You're right. You can't do MS Windows/MacOS style IME
switching, yet, in all applications.

> I can probably setup some scripts to pop up a gedit in a
> given mode, but, with the exception of VIQR and Korean,
> I cannot yet graphically switch around to any input method
> with the version of gtk2 that comes with rh8.

   Gtk2 as shipped in RH8 has Thai(broken?), Tamil,
Cyrillic(transliterated), Innuikitut, IPA, Tigrigna-Ethiopian,
Tigrigna-Eriterian,  and Amharic input modules in addition to XIM,
Vietnamese, *broken* Korean(KSC5601) input module. For Korean, you'd
better install 'imhangul' input module at http://imhangul.kldp.net. You
can download the source by clicking 'download' in red and install it by
following the instruction in the gray box below the link for download.
If this is the first time you install 'imhangul', you have to run 'make
install' twice (it's due to a bug to be fixed.)

  You can also make use of Xkb. With its support of multiple
levels, you can add yet another 'input method' to your repertoire of input
methods accessible in gedit(a gtk2 application). As for Xkb, refer to
XFree86 I18N archive.

> Hopefully, in the near future, RH will ship all utf-8
> locales by default, and gtk2 will have a XIM wrapper
> that allows access to any input method on the system
> from any language locale.

  Alternatively, 'meta XIM server' (as implemented at the client level
by Yudit and mlterm) that lets users switch between multiple XIMs will
be handy. Then, it can be used for non-gtk2 applications as well as
gtk2 applications.

 BTW, has anybody heard of gtk2 input modules for Chinese and Japanese?
A quick googling didn't turn up anything.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: Japanese Input under RH8

2002-12-06 Thread Jungshik Shin

On Fri, 6 Dec 2002, Jungshik Shin wrote:
> On Fri, 6 Dec 2002, Maiorana, Jason wrote:
> > im curious why I would set the LC_CYPTE to ja_JP.UTF-8,
> > why would that be any different than en_US.UTF-8 when the
> > LANG is en_US.UTF-8. I'm not worried about japanese collation

>   Unfortunately, most XIM servers are written in such a way
> that they can only be launched under a certain locale.  However,

  BTW,  I didn't mean that kinput2, Xcin and Ami cannot
be modified to work under en_US.UTF-8 locale. They can, but their
dependency on fontset make them work less optimal than under their
'native' locales. I guess we  have to give up 'stretching' old XIM
protocol and had better focus on a new IIIMF(Internet Intranet Input
Method Framework: http://www.openi18n.org/subgroups/im/IIIMF.
Li18Nux.org changed the name to become OpenI18N.org) or gtk2 input modules
or similar mechanisms. MS Windows has something called TSF(Text Service
Framework) which appears to be very flexible. IMHO, XIM is too old to
be on par with likes of TSF. IIIMF is at a far better position for that
than XIM.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Why doesn't Linux display Japanese file names encoded in UTF-8?

2002-12-06 Thread Jungshik Shin

On Fri, 6 Dec 2002, Jim Z wrote:

Jim,

> However, there are issues. After those changes
> when I logged into Japanese EUC locale, everything
> is displayed in English. :( So was for Japanese
> UTF-8 locale. Is that because the system couldn't
> find the resources?

 Have you checked what's in /etc/sysconfig/i18n and ~/.i18n?
Why don't you make both of them clean and see what you get?
Also make sure that you installed kde-i18n-Japanese package
for KDE?  In my case, both Gnome and KDE came up nicely in
Japanese.

> I didn't check and made sure
> that the locale.dir was modified (I'll check again).
> Also, in UTF-8 for Japanese mode, there is no
> Japanese input (Shift-space bar).

 As already noted by others, kinput2 has to be launched under
ja_JP.EUC-JP. Certainly, this has to be fixed.

> In general, looks like UTF-8 works on Lunix for CJK;

 There are still some issues (input methods as you found,
localized man pages).  Localized man pages are mostly in legacy encodings
and it's hard to figure out how to make them work in UTF-8 locale(if
at all possible). 'man', 'less' and 'groff' all do things differently
(when it comes to interpreting LC_* and LANG environment variables) and
they interact with each other in a intricate way. At least, I think 'man'
has to be fixed to either call setlocale(LC_MESSAGES,...) directly or
to use the SUS-provisioned order of resolving LC_*/LANG env. variables.
(i.e. 1. LC_ALL 2. LC_ 3. LANG)  At the moment, even 'LC_ALL=C man
xyz' doesn't give me man pages in English, let alone 'LC_MESSAGES=C'
when LANG is set to ko_KR.UTF-8.  Note that LANG should be given the
lowest precedence in the locale resolution and LC_ALL should be at the
top. Certainly, man doesn't honor that order.

  A couple of years ago, we discussed how to tag(if we decide
to tag them) the encoding used in man pages, but it got nowhwere. A
reasonable approach appears to be to conver them all to UTF-8 (assuming
groff UTF-8 support will come along soon).

> however, there is no way for general users to do what
> they intent to do.

  According to what I heard on this list, SuSe 9.1
offers UTF-8 locales for all languages as an alternative to traditional
encodings so that SuSe users should have no problem there.
Mandrake 9.0 seems to do it, but it doesn't work out of box
(I have to make some modifications) as far as I can tell.

> Your help is appreciated and I would like to see your
> fixes get into near future builds so all can benefit.

  My changes to XFree86 have gotten into CVS of XFree86 so that
I guess it'll be included in upcoming 4.3.0 release. With increasing
use of Xft/fontconfig and client-side fonts, the importance of
my patch(to X11 locale) will diminish.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: Japanese Input under RH8

2002-12-06 Thread Jungshik Shin

On Fri, 6 Dec 2002, Maiorana, Jason wrote:

> thanks for the tips, but what I really wanted was use japanese/other
> languages
> input methods, but not be in a ja_JP locale. (just the default local
> en_US.UTF-8)
> (Also I was hoping it could be done in an application that was already
> running,
> for example I would start off in VIQR, then maybe do some korean input,
> then
> switch to XIM/kinput2/canna, all in the original gedit window...)

  You're talking about two different things here. One is XIM
and the other is gtk2 input modules. Gtk2 input module mechanism (that
you bring up by 'right-clicking' in gtk2 input widget area) lets you do
what you want. It also supports XIM as one of supported 'modules'. Under
en_US.UTF-8 locale, XIM selected is (unless XMODIFIERS is set to
@im..)  the default built-in XIM which is Compose mechanism. Compose
mechanism is pretty powerful for alphabetic scripts although it's not
so useful for Japanese and Chinese.

> im curious why I would set the LC_CYPTE to ja_JP.UTF-8,
> why would that be any different than en_US.UTF-8 when the
> LANG is en_US.UTF-8. I'm not worried about japanese collation
> i'd prefer to use a default "unicode collation".

  Unfortunately, most XIM servers are written in such a way
that they can only be launched under a certain locale.  However,
gtk2 input module mechanism can be used to achieve what you want(
switching between any number of different input modules in any UTF-8
locale). Somebody has to write (a) gtk2 input module(s) for Japanese
(if it hasn't been written yet. There are a very powerful set of Korean
input modules for gtk2 all based on U+1100 Hangul Jamos alone) Then, you
can use it regardless of the locale you're in. This is great as long as
you use gtk2 applications. For non-gtk2 applications, it doesn't work,
though and there's still a need to write a 'wrapper XIM' server that
lets users to invoke multiple XIM servers at will. There are a couple of
projects going on in that direction. There's also a 'next generation input
protocol' for X11 and other platforms. (look around http://www.li18nux.org).
You can find more details in XFree86 I18N mailing list archive.

> Im curious, why do you suggest that kinput2 should be run with
> eucJP as its startup encoding? Does it have bugs if that is not the
> case?

  I guess kinput2 was written that way. That was also the case of
Korean input method Ami without my patch. Because launched under
ko_KR.EUC-KR, it  can't be used to input the full repertoire of Hangul
syllables in Unicode, I patched  it to be launchable under  under
ko_KR.UTF-8 locale.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization

2002-12-05 Thread Jungshik Shin

On Wed, 4 Dec 2002, Werner LEMBERG wrote:

> > > the manpage was not using a regular ascii '-', but instead one of
> > > the HYPEN, or EM_DASH things (Which is why i HATE them).
> >
>
> > you can configure the way your 'man' works in man.config.  You can
> > set NROFF to use '-Tascii -man' and you get 'ASCII approximation' of
> > real em_dash, hyphen etc so that you can copy and paste and search

> A better temporary solution is to add the following to man.local:
>
>   .if '\*[.T]'utf8' \
>   .  char \- \N'45'

  Thanks. It worked great. Neither of Mandrake 9 and RH 8 has this
in man.local. I guess they should.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-05 Thread Jungshik Shin

On Wed, 4 Dec 2002, seer26 wrote:

> > Filenames are _names_. They are names for _human_ use; computers would
> > be just as happy passing about inode numbers. Humans don't like
> > dealing with strings of bytes. They like dealing with strings of
>
> I think we both agree that different file names should look different.

  I guess there's a  difference of opinions on what makes filenames
different between you and David.

> The question is not whether normalization should be done, but where. My
> argument was that it should not be done inside the kernel, filesystem,
> compiler, linker, etc. But instead it should be dealt with at the Input
> Method, and user interface level.

 As for  linker, you're  assuming that you always
work _alone_ with a single compiler, a signle programming language  and
a single editor.

> (Normalized strings would always be generated by user input, and
> non-normal strings would be displayed as escape sequences.)

  What's your definition of normalized string and non-normalized
string?  If you're talking about overlong/invalid UTF-8 sequence(or
invalid in the present encoding), what you said makes sense. Otherwise,
it doesn't.  Why would they( strings in one NF and strings in other NF)
be treated differently by *UI*? They're equally valid as repersentations
of strings of _characters_. Your view that they have to be treated
differently is not consistent with your view that UI/input method
are  places where normalization should occur. If UI does that, your
'non-normalized' string should be treated by UI the same way as your
'normalized' string is.  That's what 'normalization' is for. It's not a
one-way street(from user input to system) but a two-way street.

  For sure, some advanced users may want to examine 'binary contents'
of filenames. That should be provided as optional features of UI.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-05 Thread Jungshik Shin

On Wed, 4 Dec 2002, seer26 wrote:

> > is to insist that  11,172 modern precomposed syllables be encoded
> > in Unicode/10646. Next biggest blunder they made is to encode tens
> > of totally unnecessary cluster-Jamos when only 17+11+17+ a few more
> > would have been more than sufficient. Next stupid thing they did is

> Would Chinese be in a similiar situation if it the radicals were
> combining characters, and any combination of them could in theory be
> a valid character?

  Possibly. However, radicals are only a small subset of 'components'
used in Chinese characters. You need to have a lot more 'components'
than radicals listed in any Chinese character dictionary.

> In practice, of course, a normal person would use
> far fewer than 10,000 distinct characters.

  Do you think anybody  wants a character set standard(like
Unicode) to specify the list of sequences of Latin/Greek/Cyrillic
alphabets that are allowed? Imagine  that you can use 'ab, eb, ob, se,
ce' but cannot use 'sce, gh, ph' That's what encoding a fixed set of
precomposed  syllables does for Korean alphabet.

> Have you ever needed a character that wasnt among the 11,172 precomposed
> ones?

  Sure! See <http://jshin.net/i18n/korean/hunmin.html>
or <http://jshin.net/i18n/uyeo.html>. 11,172 precomposed syllables don't
include any pre-1933 orthography syllables.  The set doesn't include
modern incomplete syllables(which high school Korean teachers need to
teach Korean grammar), either. Basically, it was a very stupid idea
(and a vast waste of codespace) to enumerate possible combinations of
alphabetic letters.  Just encoding alphabetic letters should be more than
enough. I wish Korean Nat'l Standard body had been half as competent as
as its counterpart in India. ISCII (which ISO 10646/Unicode copied almost
verbatim) did a great job of encoding only what's absolutely necessary for
Indic scripts. And, that was in early 1990's when no intelligent modern
rendering engine and font were in sight. They, however, had a foresight
that encoding hundreds or thousdands of 'presentation forms' for each
of Indic scripts was not a way to go and that eventually intelligent
and advanced fonts/rendering engine would come out. They were right and
nowadays Indic scripts are pretty well supported by Pango, Uniscribe,
ATSUI, and Graphite. It may take a little more while to have opentype
fonts in public domains for all Indic scripts, but they're coming...

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On 4 Dec 2002, H. Peter Anvin wrote:

> By author:    Jungshik Shin <[EMAIL PROTECTED]>

> >  Whether you're convinced or not, it's not only in Unicode but also
> > inscribed in ISO 10646.

> Standards change.  "Forever" is a very long time.

  Where did I use the word? 'Inscribed'? Standards change, but they
also got obsolete by a new standard. ISO 10646 someday may be replaced by MGSO
(Milkyway Galactic Organization for Standardization) 2xqwew12343.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On Wed, 4 Dec 2002, Maiorana, Jason wrote:

> >> For that reason, I dont like form D at all.  I wonder how much space
> >> it would take to represent every possible Jamo-combination, then just
> >> do away with combining characters alltogether...
> >  No way!!  The biggest blunder ever made by Korean nat'l standard body
> >is to insist that  11,172 modern precomposed syllables be encoded
> >in Unicode/10646. Next biggest blunder they made is to encode tens
..
> >available in 20.1 bit coded character set which is ISO 10646/Unicode.
>
> Wow, ok, I guess that idea wont work for Korean.
> Also, since glyph swapping has to be done for merely adjacent
> characters,
> doing it for combining ones must be a relatively minor concern.
>
> Out of curiousity, how many of those Korean letters are actually
> made use of by the language? 1.5 million sounds higher than any
> number of phoneme's that a human can produce

   Needless to say, modern Korean speakers can pronounce only
a very very small fraction and chances are that the number will decrease
as time goes by because as in most other languages, speakers are on the
winning side of the battle between listeners and speakers.  You have to
understand that Korean Hangul is alphabetic and the number of possible
syllables that can be made out of a finite set of alphabetic letters is
infinite whether it's Latin, Greek, Cyrillic, Indic or Korean.

> (what if the cluster jamo's were dropped?)

   It doesn't make any difference at all. Cluster Jamos can be
represented as well by a seqeunce of basic Jamos.  Please, note that
the most generic form of Hangul sequence is given as

   L+V+T*M?

where L, V, T, and M denote leading consonant, vowel, trailing
consonant and combining mark(for Hangul, it's most likely to be
one of two tone marks and '+', '*', '?' have their usual meanings
in RE.

That's why I wrote that cluster Jamos shouldn't have been encoded at all.
The same is true of all those 11,172 precomposed syllables. For Korean
Hangul, all we need are about a few dozens of basic Jamos. I feel 'guilty'
(although I haven't been involved in any way forcing them through)
that Korean Hangul took about a fifth of BMP codespace when about
two hundredth of that is enough.

> Are we heading for a long-run scenario, where Form-D becomes canonical,
> and all the old pre-composed codepoints are deprecated? NF-C seems
> to be getting more and more entrenched from what I can tell...

  Well, from the very beginning, UTC didn't want to have precomposed
forms in Unicode. Precomposed characters are not there because they wanted
to encode them but because they had to maintain 'compatibility' with
legacy coded character sets in which they're encoded as seprate entitites.
If they had been able to start afresh without any concern for
legacy character sets, there would have been NO precomposed
characters that can be represented by sequences of base characters
and combining characters.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin



On 4 Dec 2002, H. Peter Anvin wrote:

> By author:    Jungshik Shin <[EMAIL PROTECTED]>

> > How many? It's __infinite__ in theory. In practice, it could
> > be around 1.5 milllion.  That's more than the total number of codepoints
> > available in 20.1 bit coded character set which is ISO 10646/Unicode.

> And people give me funny looks when I tell them not to trust the "20.1
> bits forever" statement from Unicode, just as I didn't trust the
> earlier "16 bits forever" statement...

 Whether you're convinced or not, it's not only in Unicode but also
inscribed in ISO 10646.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On Wed, 4 Dec 2002, Maiorana, Jason wrote:

> If characters are ever introduced which have no precomposed codepoint,
> then it will be difficult for a font to "normalize" them to one
> glyph which has the appropriate internal layout. The font file itself
> would then have to know about composition rules, such as when
> X is composed with Y then Z, then use this glyph XYZ which has no
> single codepoint in unicode.

 Have you ever heard of Opentype and  AAT fonts? Modern font
technologies and modern rendering engines (Pango, AAT, Uniscribe,
Graphite) can all do that. Otherwise, how would Indic scripts be used
at all?  What you describe above is done by everyday by Pango,
Uniscribe and AAT/ATSUI, Graphite.

> For that reason, I dont like form D at all.  I wonder how much space
> it would take to represent every possible Jamo-combination, then just
> do away with combining characters alltogether...

  No way!!  The biggest blunder ever made by Korean nat'l standard body
is to insist that  11,172 modern precomposed syllables be encoded
in Unicode/10646. Next biggest blunder they made is to encode tens
of totally unnecessary cluster-Jamos when only 17+11+17+ a few more
would have been more than sufficient. Next stupid thing they did is
to remove compatibility decomposition between cluster Jamos and basic
Jamo sequences although they should be canonically(not just compatibly)
equivalent.  Now, you're saying that all possible combinations of them
be encoded. How many? It's __infinite__ in theory. In practice, it could
be around 1.5 milllion.  That's more than the total number of codepoints
available in 20.1 bit coded character set which is ISO 10646/Unicode.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On Wed, 4 Dec 2002, Maiorana, Jason wrote:

> Normalization for D has some serious drawbacks: if you were to try
> to implement, say vietnamese using only composing characters,
> it would look horrible. The appearance, position, shape, and size
> of the combining accents depends on which letter they are being
> combined with, as well as which other diacritics are being combined
> with that same letter.

  What's your point here? NFD or NFC, they should be rendered
identically by 'modern' rendering engines.  You're making an assumption
that the way characters are rendered depend on in which NF they're
stored/represented. At least in principle, that should not be the case.
Even a not-so-capable renderer(e.g. xterm with bitmap font or
Linux console) can do a internal normalization to fit their need
and capability.

> NF-C is most appropriate for some scripts, and NF-D may be desirable
> for others. It would be better,

  What are your criteria? Again, rendering? As I wrote above,
that has nothing to do with NFs used.

> IMO, if unicode would get rid
> of both forms, and simply support one representation of each
> possible glyph. (No combining characters unless they are the ONLY

  'glyphs'? Coded character set is not about glyphs but about
characters.

> way to represent a particular glyph) (Actually, no combining chars
> at all would be best, because its simplest. Why not just assign
> more code space to the langs that need it?)

 Do you want to give 1.5 million (and more) code points to Korean script?
Why don't you propose your idea to UTC and ISO/IEC JTC1/SC2/WG2?
Either your mailbox will be bombarded with a lot of emails
or you will be greeted with 'dead slience'.

> If you have a filesystem that forces NF-D, then I would say its a
> poorly designed filesystem that makes such choices, because its
> way to low level to care about things like that. Filenames should
> be "string of bytes", and the UI-conventions should allow one
> to distunguish. If you are on a NF-C==canonical system, and you
> mount such a filesystem, you should see bakemoji, and not
> any translated normalization form.

  Why bakemoji? No matter what NF are used in filenames, they should
be just rendered as they should be rendered by any Unicode-compliant
rendering engines.  This behavior is more  consistent with your view
that filenames are strings of bytes than showing 'bakemjoi'.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

RE: filename and normalization

2002-12-04 Thread Jungshik Shin



On Wed, 4 Dec 2002, Maiorana, Jason wrote:

> As a side-note, I copy/pasted a command line flag from a RH8.0
> manpage back into the console, and tried to execute the command.
> It failed, and gave me usage. The reason, I discovered, is that
> the manpage was not using a regular ascii '-', but instead one
> of the HYPEN, or EM_DASH things (Which is why i HATE them).

  I discovered that a long time ago and gave up copy'n'pasting from
man pages.  I began to write that those characters should not be used in
man pages, but then I came up with a couple of argument against my own and
didn't send a message here. One of them was that you can configure the
way your 'man' works in man.config.  You can set NROFF to use '-Tascii
-man' and you get 'ASCII approximation' of real em_dash, hyphen etc so
that you can copy and paste and search backwad/forward for command line
options. Another was that man page is not only for screen viewing but
also for print out. When printed out, genuine hyphen and em dash look
certainly better than their ASCII approximation.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On 4 Dec 2002, H. Peter Anvin wrote:

> By author:    Jungshik Shin <[EMAIL PROTECTED]>

> >   All right. That's what the *current* SUS/POSIX says. However, that
> > is hardly a solace to a user who'd be puzzled that two visually
> > identical and cannonically equivalent filenames are treated differently.

> There *is* no way to solve this problem.  You have the same kind of
> problem with U+0041 LATIN CAPTIAL LETTER A versus U+0391 GREEK CAPITAL
> LETTER ALPHA.  However, if you attempt normalizations you *will*

  U+0041, U+0391, and U+0410 are NOT  equivalent in any Unicode normalization
form. They're not even equivalent in NFK*.  Note that I didn't
just say visually (almost) identical but also modified it
with 'canonically equivalent'.

> introduce security holes in the system (as have been amply shown by
> Windows, even though *its* normalizations are even much simpler.)

  Therefore, your exmaple cannot be used to show that there's a security
hole(unless you're talking about applying normalization not specified
in Unicode) although it can be used to demonstrate that even after
normalization, there still could be user confusion because there are some
visually (almost) identical characters that would be treated differently.

  A better example for your case would be U+00C5(Latin captial
letter with ring above) and U+212B(Angstrom sign) or U+004B and
U+212A(Kelvin Sign). They're canonically equivalent.

> available to the user (ls -b or somesuch.)  Attempting
> canonicalization is doomed to failure, if nothing else when the next
> version of Unicode comes out, and you already have files that are
> encoded with a different set of normalizations.  Now your files cannot
> be accessed!  Oops!

 I might agree that normalization is not necessarily a good thing.
However, your cited reason is not so solid. Unicode Normalization form is
**permanenly frozen** for exisitng characters. And, UTC and JTC1/SC2/WG2
committed themselves not to encode any more precomposed characters that
can be represented with existing base char. and combining characters. If
you're not sure of their committment, perhaps using NFD is safer than
using NFC. Hmm.. that may be one of reasons why Apple chose NFD in Mac
OS X.

  BTW, without changing anything in Unix APIs and Unix filesystem(which
are not desirable anyway), shells 'might' be a good place to
'add' some normalization (per user-configurable option at the time
of invocation and with  env. variables)

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

filename and normalization (was gcc identifiers)

2002-12-04 Thread Jungshik Shin

On 3 Dec 2002, H. Peter Anvin wrote:

> By author:    Jungshik Shin <[EMAIL PROTECTED]>

> >  The same is true here. Although Unix file system has few
> > restrictions on file/dir names, it needs to have a provision to specify
> > how to deal with multiple representations of equivalent characters. Is
> > there anything mentioned about this in SUS?
>
> Yes.  Filenames are byte sequences, period, full stop.  Any attempt at
> normalization would violate SUS/POSIX.

  All right. That's what the *current* SUS/POSIX says. However, that
is hardly a solace to a user who'd be puzzled that two visually
identical and cannonically equivalent filenames are treated differently.
For instance, U+00D6(Latin Capital Letter O with diaresis) should look
identical and be treated identically with U+004F foll. by U+0308. That's
what users expect.  I don't know what's the best way to resolve
this conflict. It may be time to consider seriously this particular
aspect of SUS/POSIX.  I'm wondering how MacOS X (well, it's not 100%
SUS/POSIX compliant, but nonetheless it's Unix) works in this area. It
uses NFD. That is, 'U+00D6' is stored as 'U+004F U+0308' and both are
treated idnetically.

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: gcc identifiers

2002-12-04 Thread Jungshik Shin

On Wed, 4 Dec 2002, Keld JÃ¸rn Simonsen wrote:

> On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote:

> > Maybe a --normalize-utf option to the linker might be a good idea, but
> > it should be an option, IMO.
>
> First of all, the standard does not refer to Unicode, but to 10646.
> And the C standard does not use Unicode normalization.
> There is a list in the ISO C standard of 10646 characters that are
> allowed in identifiers, and these do not have alternate representations.

  Thank you for the note.

  I found FCD of ISO/IEC 9899 1999 (N2794 at
http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug.,
1998.  In Annex I 'Universal Character names for identifiers'(page
487. If you use Acroread  to view PDF version, it's 499), a set of
characters allowed are listed. (More or less identical list is found at
http://std.dkuug.dk/TC1/SC22/WG20/docs/standards#10176) Basically ISO C99
seems to avoid problems arising from multiple representation issues by
allowing only precomposed characters in identifiers(is there any change in
this regard in the finally approved ISO/IEC 9899 1999?) Keld's statement
that they do not have alternate representations is not right.
If that's the case, characters like 'Latin Small Letter with Macron'
or 'Hangul Syllable Gga' for which there are alternate representations
should not be present in the list, but they are listed as allowed.

  What ISO C99 seems to do is to shift the burden of normalization to
editors or whatever tool used by programmers to edit source files from
compilers and linkers.  That's fine(editors can do that) and is perhaps
a wise decision (preventing potential troubles from propagating thru
a compiler-linker chain at the earliest stage by issuing an error and
stopping compilation), but there's a little trouble with allowing only
precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
any more precomposed characters which can be represented with exisitng
base characters followed by one or more combining characters. However,
'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
identifiers  so that 'any character' that's not encoded as a precomposed
form can't be used in identifiers. Some people would resent not being able
to use 'their characters' in identifiers and may use it to make a case for
encoding precomposed forms of theirs in ISO 10646.  How about references
to filenames (as in '#include directive') with combining diacritic
marks that are parts of characters NOT encoded in precomposed form?
Aha, they can use '\u, or \U)...

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: Why doesn't Linux display Japanese file names encoded in UTF-8?

2002-12-03 Thread Jungshik Shin

On Wed, 4 Dec 2002, Jim Z wrote:

Jim,

This time, I hope my answer will solve your problem :-)

> >From: Jungshik Shin <[EMAIL PROTECTED]>
> >On Tue, 3 Dec 2002, Jim Z wrote:

> >   You can easily  add 'Japanese(UTF-8' to your gdm/kdm language
> >selection menu. See
> ><https://bugzilla.mozilla.org/bugzilla/show_bug.cgi?id=75829>
> I couldn't get into here and is it a typo? PLEASE help - I really want to

   I'm sorry it's <https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829>

> > > I did a 'showmount -e 10.xxx.xxx.xxx' but I got scambled Japanese
> > > characters for those entries that are encoded in UTF-8. Then I switched
> >the
> > > locale to ja_JP.UTF-8, but the same stuff was returned. What's wrong
> >with
> > > this picture?

> It's an UNIX (Linux) to UNIX (NetBSD) mount. The UTF-8 Japanese file names
> are in my NetBSD:/etc/exports. I can only mount those entries that are ASCII
> equivalent. I also tried it from Solaris 8 (logged in as 'Japanese UTF-8
> (Unicode)') and it worked fine. I am sure if I can turn on UTF8 mode I
> should be able to do so.

  NFS should be encoding-neutral just like the rest of Unix FS
is. (except for cases like exporting to and from non-Unix systems where
different file systems are used.). Why don't you begin with a simpler
case? Before using UTF-8 for directory names to export via NFS,  you can
begin with making sure UTF-8 filenames under a NFS-exported directory
come out all right on the client side.  BTW, I've just experimented
with UTF-8 directory names in export list(/etc/exports), it worked fine
between Mandrake 9.0(server) and RedHat 8.0(client). Judging from this
and the fact that Solaris and NetBSD worked fine, it should also work
between NetBSD and RH 7.3

> >   Needless to say, you have to run your shell in UTF-8 terminal
> >(e.g. xterm 16x or mlterm) to view UTF-8 characters.
> >
> I can't get it to work. 'xterm -u8' doesn't work. the locale never changes.
> From Solaris you can do a "LANG=ja_JP.UTF-8 dtterm &" and the new dtterm has

   You have to do the same for xterm as you do for
dtterm. 'LANG=ja_JP.UTF-8 xterm'. '-u8' option is not necessary for recent
xterm. Or, you can do in the opposite order. That is, run 'xterm -u8'
and then set LANG to ja_JP.UTF-8 in xterm (UTF-8). Actually, you have to
do the latter way if your /etc/sysconfig/i18n or ~/.i18n sets $LANG to
a value other than ja_JP.UTF-8 because the shell initialization script in
RedHat *overrides* the value set before the shell invocation with the value
in /etc/sysconfig/i18n or ~/.i18n.(see /etc/profile.d/lang.(sh|csh)).

> what is mlterm? Couldn't find it on Linux 7.3.

  I'm not sure if it's in RH 7.3. You can get it at
http://mlterm.sourceforge.net

  Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: readdir() on linux

2002-12-03 Thread Jungshik Shin

On Tue, 3 Dec 2002, marco wrote:

> Ok, does anybody know if the same applies to other unices (e.g.:
> AIX/Solaris)?
> I would like to understand how Linux compare to these commercial OS's.

  In a sense, it can be argued that Linux is more compliant to
Single Unix Specification than (some) commerical Unix. Unix filesystem
never has had the internal information about the 'encoding/charset'
other than they're null terminated sequences of octets. When all
we move onto UTF-8, it shouldn't matter. Until then, you have
to rely on external information.

 Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

1 2 >

1 - 100 of 181 matches

Mail list logo