Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1

2004-03-08 Thread Tomohiro KUBOTA
Hi,
(B
(BI received the following mail personally.  The writer permitted
(Bme to cite it to linux-utf8.
(B
(B
(BFrom: [EMAIL PROTECTED]
(BSubject: Re: relevance of "[PATCH] tty utf8 mode" in linux-kernel 2.6.4-rc1
(BDate: Tue, 02 Mar 2004 09:34:37 -0500
(B
(B (I cant post to this list right now, its refusing my ISP's email relay
(B so I'm writing to you directly)
(B 
(B Tomohiro KUBOTA wrote:
(B 
(B Why do you think Kanji support is somewhat "fancyful" while the real
(B Linux kernel has been supporting Latin/Cyrillic/Arabic/Greek and UTF-8?
(B Is it because east Asian people are less important than European people?
(B   
(B 
(B 
(B This is a good point, however it may be impractical to load a full
(B featured unicode settings and options, input method, and conversion
(B engine very early in the kernel bootstrap process.
(B Even if it was added to the kernel the resulting size might still be
(B too much to get meaningful support into LILO or GRUB, for example.
(B 
(B A compromise might be to use half-width katakana for kernel startup
(B messages. English has accepted a considerable amount of change from the
(B world of typewriters and computers such that the language has been
(B adapted to accomodate them as much as they to it. For very small
(B embedded systems and kernel bootstrap routines, half-width katakana or
(B a similar language compromise is more practical in my opinion.
(B 
(B Once the full, general purpose operating system has been loaded, a
(B proper and full featured language interface would of course become
(B available.
(B 
(B I think this is a reasonable compromise: A user who was not interested
(B in the guts of the operating system would never see this stuff anyway:
(B instead they would be presented with a nice shiny graphic while the
(B system started up.
(B 
(B 
(B $B%h%m%7%/(B,
(B
(BIn my opinion, i18n support of Linux console is important primarily
(Bfor reading translated messages from various administrating commands.
(B
(BIn Japanese case, translated messages are written in normal Japanese
(B(mixture of Hiragana and Kanji (and Katakana for transliteration from
(Bforeign languages)), not in Katakana.  It is impossible to transliterate
(Bfrom normal Hiragana-Kanji Japanese text to Katakana text easily.
(B(It needs dictionary of whole Japanese vocabulary, which is apparently
(Bmuch larger than a set of Japanese font).
(B
(BTo read Japanese translated messages, support of Hiragana, Katakana,
(Band Kanji (CJK Ideogram) is needed.  A compromise will be discussed
(Bwhat range of CJK Ideogram will be supported.  In case of Japanese,
(BJIS X 0208 (less than 7000 characters) would be a moderate choice.
(BJIS X 0212 (less than 7000 characters) set is also included in the
(B"CJK Unified Ideographs" (U+4E00 - U+9FAF), but it would be optional
(Bfor Linux console.
(B
(BIt may be feasible that Japanese *input* support of Linux console
(Bwill be limited to Hiragana or Katakana, because Japanese input system
(Bwill need dictionary of whole Japanese vocabulary and grammatical
(Banalysis system.  (In future, when such large amount data will be
(Brelatively "small" than average disk/network capacity, there might
(Bbe real need to support Japanese input.)
(B
(B---
(BTomohiro KUBOTA [EMAIL PROTECTED]
(Bhttp://www.debian.or.jp/~kubota/
(B
(B--
(BLinux-UTF8:   i18n of Linux on all levels
(BArchive:  http://mail.nl.linux.org/linux-utf8/

Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1

2004-03-02 Thread Tomohiro KUBOTA
Hi,

From: Bruno Haible [EMAIL PROTECTED]
Subject: Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1
Date: Tue, 2 Mar 2004 15:17:32 +0100

 No, you don't need a cursor at the middle position of a doublewidth character.
 
 There are two use-cases of terminals:

...

   a) The applications which assume a line-oriented display and don't care
  about the line width. For these a line-oriented (or paragraph-
  oriented) terminal model is suitable. This terminal can decide about
  character widths on its own, do bidi and ligatures, possibly use
  proportional fonts.
 
  In this case there is no use for | for line drawing, or for block
  graphics.

Right.

   b) The applications which assume a cell matrix. Examples: vim 6,
  GNU readline, X/Open curses. These applications know what is
  represented on the screen, and where, because they keep their own
  cell matrix.
 
  When such an application wants to put a | at position (x, y), it
  can do
 
(gotoxy x-1 y) space space backspace |
  or
(gotoxy x-1 y) space space (gotoxy x y) |
 
  instead of the simplistic
 
(gotoxy x y) |
 
  that you propose.

Softwares have to be impremented as such.  Otherwise, they fail.


  If you are thinking about far future, please think a completely different
  system, instead of modifying an existing tty system.
 
 No, the tty system has to be modified where needed.

When modified, compatibility must be kept.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: relevance of [PATCH] tty utf8 mode in linux-kernel 2.6.4-rc1

2004-03-01 Thread Tomohiro KUBOTA
Hi,

You always make light of compatibility to non-European-language
environments.  Even if it were not an adequate choice, standards
need to keep compatibility to popular past environments.

Kernel will have to handle wcwidth() anyway
  - to display doublewidth characters on the console
  - to calculate the cursor position on the console after processing
a 0x08 (in your case; if a 0x08 moves one *cell* in any case, the
calculation does not need wcwidth())

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Linux console internationalization

2003-08-14 Thread Tomohiro KUBOTA
Hi,

From: Innocenti Maresin [EMAIL PROTECTED]
Subject: Re: Linux console internationalization
Date: Wed, 06 Aug 2003 03:22:27 +0400

 Tomohiro KUBOTA wrote:
 
  Interesting, but any plan to support more than 512 characters?
 
 Not within VGA text modes.
 2^9 is a hardware restriction based on text framebuffer's data semantic.

I see.  It is since MS/PC-DOS version 6.x (so-called DOS/V) that
IBM-compatible PC began to be able to display Japanese characters on
text screen.  (Before then Japanese local PC was used which has
hardware Japanese support.)

I imagine the MS/PC-DOS used VGA graphic mode.  (I heard that V
in the name DOS/V came from VGA.)

 And I think that 9x16 (this is the largest glyph size usable in VGA text)
 is apparently much less than is needed to read Japan glyphs without risk of eyes.
 Even for 12-year-old Japanese person ;-)

Right.  On tty, Japanese character are displayed using two columns.
For example, when ASCII characters are 8x16, Japanese characters are
16x16.

 So, VGA text seems not to be an acceptable solution for East Asia.

Right.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Linux console internationalization

2003-08-05 Thread Tomohiro KUBOTA
Hi,

From: Innocenti Maresin [EMAIL PROTECTED]
Subject: Linux console internationalization
Date: Wed, 06 Aug 2003 02:25:32 +0400

 P.S. I just done a Web-page descibing my view of Linux console i18n
 and further plans.
 There is also a glossary of used terms.
 http://www.comtv.ru/~av95/linux/console/

Interesting, but any plan to support more than 512 characters?
512 is apparently much less than east Asian people's need.
(For example, Japanese basic character set (JIS X 0208) has
several thousands of characters.  12-year-old Japanese person
should know roughly one thousand characters and adults should
know much more.)
And, how about fullwidth characters (i.e., return value of
wcwidth() is 2) and combining characters (wcwidth() is 0),
like xterm supports them?

I am looking forward to linuxconsole project
http://linuxconsole.sourceforge.net/
Do you know the project?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gtk2

2003-04-02 Thread Tomohiro KUBOTA
Hi,

From: srintuar26 [EMAIL PROTECTED]
Subject: gtk2
Date: Tue, 1 Apr 2003 22:02:36 -0500

 gnome-terminal and multi-gnome-terminal are fairly lightweight.
 Also, the user interface used to configure, interact with, and 
 use the input method has to use some toolkit. I'd say gtk2 is
 as good a choice as another.

As Glenn wrote, gnome-terminal is not very lightweight.

And, do you say that non-European-language speaking people don't
need to have choices?  For example, there are people who like Eterm,
Aterm, Wterm, Rxvt, Xterm, or so on.  (Note that all of them support
XIM.)  Is it a priviledge of European-language-speaking people to
say such preferences?  It is what I wanted to call ethno-centrism.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



alias in fontconfig (Re: supporting XIM)

2003-03-31 Thread Tomohiro KUBOTA
Hi,

From: Jungshik Shin [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Mon, 31 Mar 2003 11:08:53 +0900


  - a word processor whose menus and messages are in English but can
input/display/print text in your native language
 Which is better?  The first one is completely unusable and the second
 one is unconveinent but usable.
   
 
  I agree with you on this point. That's why I compared the status of KDE 
 in 1999-2000
 with that in 2003. Back in 1999-2000, KDE/Qt people thought that 
 translating messsages
 is I18N, but they don't do any more and KDE/Qt supports 'genuine I18N' 
 much better now.

I am glad there are people who understand this point.  Several years
ago, even when I said this tens of times I was ignored.


  - Xmms cannot display non-8bit languages (music titles and so on).
 
 
Are you sure? It CAN display Chinese/Japanese/ Korean id3 v1  tag 
 as long as
 the codeset of  the current locale is the codeset used in ID3 v1 tag.  

I'll test this further.  However, please note I won't be satisfied by
i18n which require specific configuration other than setting LANG
variable (and installing required softwares and resources).


  - Xft/Xft2-based softwares cannot display Japanese and Korean at the
same time while Xft and Xft2 are UTF-8-based, because there are no
fonts which contain both of Japanese and Korean.  This should not
be regarded as a font-side problem, because (1) font-style principle
is different among scripts (there are no courier font for Japanese)
 
 
  You can use 'alias' in fontconfig  if some programs use 'Courier' 
 or 'Arial' instead
 of generic fonts names like 'monospace', 'serif', 'sansserif', and so 
 forth.

I want such alias to be automated.  If I have one Korean font installed,
it is obvious that renderer must use the font for all Korean texts.
It is not a good idea that the renderer fail to display Korean when
the user doesn't configure the alias.

Since typography is different among scripts (Latin, Cyrillic, Greek,
Han, Hangul, Hiragana, Katakana, Arab, Hebrew, Thai,...), we cannot
expect there will be various fonts which include various scripts in
the world (except for a few basic fonts like 'misc' or 'sansserif').
I cannot imagine courier Hiragana fonts nor mincho Arab fonts.
This is why alias mechanism is not a makeshift but a naturally
needed mechanism.


  - There are no lightweight web browser like dillo which is i18n-ed.
 
  I think that w3m-m17n is an excellent lightweight browser that 
 supports I18N well.

Well, I meant a lightweight GUI browser.  Though I haven't checked,
I imagine dillo and so on use 8bit font mechanism.

There is another i18n extension of w3m: w3mmee.  I don't know which
is better.


  - FreeType mode of XFree86 Xterm doesn't support doublewidth characters.
 
Well, it sort of does. Anyway, I submitted a patch to Thomas and I expect
 he'll apply it sooner or later. After that, I'll add '-faw' option 
 (similar to '-fw' option).

Fantastic!  May I want more?  Xterm can automatically search a good
(corresponding) doublewidth font in non-FreeType mode.  How about
your patch?


   I already mentioned this issue. Programs like 'fmt' has to be 
 modified, but there's already
 an alternative to 'fmt' that supports Unicod linebreaking algorithm.

When I wrote this sentence, I thought about Text::Wrap() in Perl.


---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-31 Thread Tomohiro KUBOTA
Hi,

From: srintuar26 [EMAIL PROTECTED]
Subject: Re: gtk2 + japanese; gnome2 and keyboard layouts
Date: Mon, 31 Mar 2003 01:44:15 -0500

 Well I for one have been placated for now by im-ja. Its precisely
 what ive been looking for, and extensive googling didnt root it out.

I also tested im-ja Debian package with gnome-terminal.
I felt it surely will be a convenient tool after more development.

There are some points.

- Japanese input methods need user preference configuration.
  For example, some (not small part of) Japanese people want
  Ctrl+U to be Hiragana conversion, Ctrl+I to be Katakana
  conversion, Ctrl+O to be Hankaku conversion, Ctrl+P to be
  Alphabet conversion (reverse romaji/kana conversion), without
  Kakutei (determination).  These key bindings are from popular
  commercial input engine ATOK.  (It is my first input engine
  in 15 years ago for about 8 years.  After that, I configured
  all input methods (other than SKK) to ATOK-like key binding.)

- Japanese input methods have a key sequence to switch
  no-conversion - kanji conversion.  In im-ja, Shift+Space or
  Henkan (available in Japanese keyboard) key switch no-conversion
  - Hiragana - Katakana - Canna - Kanjipad - no-conversion.
  It is not suitable for Japanese people who want to input large
  amount of Japanese text as a mother tongue (or first language).
  Usually, such omnibus switching (Hiragana - Katakana - Kanji
  - Kanjipad - JIS table - ...) is bind to F10 key.  I think it
  should be configuratable, too.
  (I don't know why (from what analogy) Henkan key was originally
  used for this purpose.)

- Canna mode seems not to show some important informations such
  as conversion border (Bunsetsu border) and current converting
  Bunsetsu.

- Canna mode seems not to supply various conversion keys.
  (For example, conversion border larger/smaller, Hiragana
  conversion, Katakana conversion, and so on).  I may be wrong
  because I have not tested very well.  (How about dictionary
  handling, JIS character table, and so on?)

Does GTK+2 Input Method Framework supply ways for input methods
to supply confurators?

Are there any Japanese member in Im-ja developers?  Japanese
people know many tiny but important points for to achieve
convenient input method and user interface.

Anyway, I imagine most of Japanese people will continue to XIM
for a while because (1) changing input method is like changing 
keyboard from QWERTY to DVORAK, (2) GTK+2 input method is not
supported by popular softwares (you can imagine it is confusing
to using multiple input methods with different user interfaces,
it is like using QWERTY for one software and DVORAK for another),
and (3) conversion dictionary which a user tought many words and
conversion order of homonyms is a valuable thing and changing
input method may mean losing the data.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: gtk2 + japanese; gnome2 and keyboard layouts

2003-03-31 Thread Tomohiro KUBOTA
Hi,

From: srintuar26 [EMAIL PROTECTED]
Subject: Re: gtk2 + japanese; gnome2 and keyboard layouts
Date: Mon, 31 Mar 2003 21:27:29 -0500

 Yes, that is a good point, but it brings up a question:
 how is this going to interact with applications which already
 have meanings for CTRL+O (File Open), CTRL+P (Print), etc

Key bindings of Japanese input methods are classified into
(at least) three categories:
 - keys which must be available everytime
 - keys which must be available only when input method is active
 - keys which msut be available only when there are undetermined
   string

My examples of CTRL+O and CTRL+P are the third category, because
they convert current undetermined strings into Hiragana, Katakana,
and so on.  In other cases, the input method can pass these key
sequences to the application softwares.

Only the first category keys are fatal for collision.  However,
it includes only one key, input method activation (like Shift +
Space or Henkan in im-ja).

Keys like mode change among Hiragana/Katakana/Kanjipad are the
second category in ordinary input methods, though im-ja assignes
Shift+Space or Henkan (same to input method activation) for this
function.


 As a primary input method for a native speaker: I think it
 needs pehaps a bit more work, and of course evolution, mozilla,
 vim, etc, have to complete their transitions to gtk2.

To be popular among native (Japanese) speakers, popular softwares
must support GTK2 input methods.  For example, mule/emacs/xemacs,
kterm, rxvt, xterm, and KDE softwares.  Especially, mule/emacs/xemacs
is overwhelmingly popular among Japanese because it has been the
only way to write Japanese in both of X and non-X environments
for tens of years.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-30 Thread Tomohiro KUBOTA
   whitespace between words.


I feel that CJK people everytime have to keep a watch on softwares
which are already i18n-ed, because i18n support of such softwares
are sometimes broken when new versions are released.  (Xedit often
changes its status (can use XIM or cannot use XIM).  What happens?)
This is fatal if translation is already supplied (like OpenOffice.org
case).  I think a certain part of CJK developers' time are wasted
into this.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Pango tutorial? (Re: supporting XIM)

2003-03-30 Thread Tomohiro KUBOTA
Hi,

From: srintuar26 [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Sun, 30 Mar 2003 19:25:41 -0500

 If the theme engine uses pango for layout, and a desired language
 context is understood, I think this would work fine. Pango can always
 substitute fonts for missing glyphs...

Unfortunately, there are no tutorials for Pango.  A developer of Xplanet
and I sent mails to a Pango developers (Evan Martin and Noah Levitt) to
ask that but they think Pango is not intended to be used from applications
directly but from upper toolkit layer.

However, GTK2 is too heavy to be recommended for *all* softwares which
displays some text.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-30 Thread Tomohiro KUBOTA
Hi,

From: H. Peter Anvin [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: 30 Mar 2003 17:02:58 -0800

 Perhaps not double-width, but there are plenty of non-ASCII,
 non-ISO-8859-1 characters in the Unicode set that should be
 interesting to U.S. programmers.

This is a good information.  I hope such people will hard-code
UTF-8 support up to two bytes.  Though I didn't find such softwares,
I heard there are such softwares.  We have to continue keeping watch
on i18n implement of softwares

How about em-dash or ligatures such as fi or ffl?  Are they
doublewidth?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-30 Thread Tomohiro KUBOTA
Hi,

From: srintuar26 [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Sun, 30 Mar 2003 19:25:41 -0500

   - Tcl/Tk's XIM support is unstable even now.  (Every time I try to
 input Japanese, it sticks).  When I read Tcl/Tk's roadmap in
 version 8.0 age, I was really surprised that XIM support (essential
 for CJK, as you know) is very low priority.
 
 eh, XIM needs to be dropped imo. From personal observation, building
 tools such as XIM and IIIMF which are integrated into the X server is
 the wrong way to go, and GTK+ input methods seem to work much better.

Why wrong?  Anyway, CJK people are waiting for years.  No more vaporware.
Note that Tcl/Tk-based softwares which need text input are not usable
at all because of this problem.


   - Text line wrapping.  Chinese and Japanese (not Korean) don't use
 whitespace between words.
 
 Ooh, that makes me curious: is there a good discussion of how to
 line-break Japanese text? I wonder how browsers are doing it...

(Non) usage of space in Chinese and Japanese causes problem on
text search system such as mnoGoSearch.  Now mnoGoSearch developer
team seems to be thinking about using ChaSen to analyze Japanese
text (though ChaSen doesn't support Chinese).  Also, I cannot
imagine a Japanese dictionary for ispell.

Line-break in Japanese can be done almost any places except for
several symbols (like kuten and touten which are like period and
comma in English sentences).  Also Japanese sentences often contain
Latin alphabets (for example, there are many companies whose names
are written in Latin alphabets, like SONY, NEC, and so on) and
whitespace.  Note that LF code in original Japanese text must not
regarded as a space (Don't insert a space when connecting Japanese
lines).

However, Thai is much more difficult.  It doesn't use whitespace
between words, but line-breaking must be done at borders of words.
It means that Thai dictionary is needed to achieve correct line-
breaking for Thai.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-29 Thread Tomohiro KUBOTA
Hi,

From: Glenn Maynard [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Fri, 28 Mar 2003 16:49:31 -0500

 Stop using the word racist.  It's like saying if you don't support a
 feature I want, you're supporting terrorism; it makes people groan and
 stop paying attention.  It's inflammatory, doesn't help your case at all,
 and injures your credibility.

I see.  I didn't know subtle nuance of the word.  (Dictionaries never
teach us about such nuances.)

However, I am often annoyed by people who think supporting European
languages is more important than supporting Asian languages even when
there are no technical problem to achieve such support.  They never
have racist ideas.  They just feel non-European languages are somewhat
exotic and support of such languages is a special feature of softwares.

For fair, I should mention that usual Japanese developers and users
don't think about non-Japanese/English language support.  I don't 
think they are racists.  They just forget there are languages other
than Japanese and English.

How should I call such people?  I know they are never racists in its
original meaning.

Note that even if they are not racists, the result (that there are
few internationalized softwares) is as almost same as they are really
racists.  The difference is -- I have a little hope to persuade these
developers not to forget about non-European-language speakers.  On the
other hand, Real racists are those who explicitly know about non-
European-language speakers and who think they should be discriminated.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-28 Thread Tomohiro KUBOTA
Hi,

From: Jungshik Shin [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Thu, 27 Mar 2003 18:38:51 -0500 (EST)

   That's not a problem at all because there are Korean, Japanese
 and Chinese input modules that can coexist with other input
 modules and be switched to and from each other. With them, you
 don't need to use XIM.
...


One point: Many Japanese texts include Alphabets, so Japanese people
want to input not only Hiragana, Katakana, Kanji, and Numerics but
also Alphabets.  I imagine Korean people want, too.  In such a case,
switching between Alphabet (no conversion mode) and conversion mode
has to be achieved by simple key typing like Shift + Space.  The switch
must be between conversion mode and no-conversion mode, must not be
among all installed input methods.  Is it possible in GTK applications?
(This is achieved in Windows.  Alt-Esc will switch between conversion
and non-conversion, while Alt-Shift will switch among installed input
methods.)

Another point: I want to purge all non-internationalized softwares.
Today, internationalization (such as Japanese character support) is
regarded as a special feature.  However, I think that non-supporting
of internationalization should be regarded as a bug which is as severe
as racist software.  However, GTK is a relatively heavy toolkit and
developers who want to write a lightweight software won't use it.
I never think If there is one internationalized software (for example,
gnome-terminal), it is enough.  If developers want to develop another
softwares in the same category (xterm, rxvt, eterm, aterm, ...), it
means users have freedom to choose.  Such a freedom of choice must not
be a priviledge of English-speaking (or European-languages-speaking)
people.  Do you have any idea to solve this problem?


   There is at least one Japanese gtk2 input module as I wrote above.
 You just have to install it because it doesn't come default with
 gnome 2.x.

Japanese people need multiple input modules.  This is because Japanese
conversion is too complex for a software to perfectly achieve it.

Since complexity itself sometimes confuses users, there are input
methods which want to be simple so as not to surprise users.
(However, such simplicity is achieved by requiring users more
information or keyboard input for conversion.)
People who don't want to keep watching screen nor keyboard during
input sentence (expert users) tend to prefer such simple methods
with less need to watch screen to confirm conversion result.

SKK is one of such methods.  It cannot convert multiple words at
a time (unlike most of modern input methods) but it means that it
never convert one word into (wrongly) multiple words.  T-Code is
much more spartan input method with one-to-one mapping from a keyboard
sequence to a kanji.  Though a user has to remember thousands of
such mappings because Japanese language needs thousands of kanjis,
such input methods are popular in a certain amount of (not many)
Japanese people.

Of course several Japanese companies are competing in Input Method
area on Windows.  These companies are researching for better input
methods -- larger and better-tuned dictionaries with newly coined
words and phrases, better grammartical and semantic analyzers,
and so on so on.  I imagine this area is one of areas where Open
Source people cannot compete with commercial softwares by full-time
developer teams.

How about Korean?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-26 Thread Tomohiro KUBOTA
Hi,

From: Edward H Trager [EMAIL PROTECTED]
Subject: Re: supporting XIM
Date: Wed, 26 Mar 2003 12:29:30 -0500 (EST)

 I'd also like to be able to see instantaneous, on-the-fly switching of
 language/locale without having to restart KDE or Gnome or the program
 being used.  I want to be able to just hit a button or key combination to
 switch everything from, say, English to French, or Chinese, or Japanese...
 It would be similar to using Yudit where I can easily assign function keys
 for changing the keyboard map/ input method.

Is it possible to implement an XIM server as a wrapper for other XIM
servers and input method engines/libraries?  It would also wrap locale
so that UTF-8 softwares would be able to connect with Canna.

BTW, mlterm (http://mlterm.sourceforge.net) can switch XIM servers
on-the-fly.  Since it manages XIM-connecting locale independently
from its main locale, it can (for example) use Canna and so on from
en_US.UTF-8 locale.  I think you can customize mlterm so that you can
switch input methods by function keys or other keys.

However, such an application-side solution is not very good, because
it depends on application softwares and most softwares would have
poor input method supports because most of developers in the world
don't know very much on input methods.  I want all softwares including
lightweight ones will able to input/output not only ASCII or 8bit
characters but also my mother tongue (Japanese).  I don't want to
say Hey, I am lucky!  At last I found a Japanese-capable software!.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-25 Thread Tomohiro KUBOTA
Hi,

From: Maiorana, Jason [EMAIL PROTECTED]
Subject: RE: supporting XIM
Date: Mon, 24 Mar 2003 11:11:31 -0500

 I think it should be much more stateless, allowing the client
 library to do the rouma/kana conversions, and simply
 having the server anwer queries for possible Kanji, of course
 all in UTF-8. The state of the clients interface should be
 kept on the client side, imo

FYI: Anthy is designed as a library-based input method.
All tasks including not only rouma/kana conversion but also
kanji conversion is done in the library.  GTK+ module and
XIM module are provided.  (I have not tested Anthy).
I heard that Anthy stopped to provide IIIMF module because
the developers thought IIIMF protocol has some security problem
but I don't know about the problem.  Hiura-san, do you know
something about this?

Canna and Wnn (now FreeWnn) are designed as client-server
style systems.  They have their own protocols.  Emacs (tamago),
Xemacs (with Mule-Canna-Wnn), and kinput2 are well-known
clients for Canna and Wnn servers.  You know, kinput2 is
an XIM server for Canna and Wnn.

Thus, if you don't like XIM but don't hesitate to use Canna
or FreeWnn, there might be a way to develop GTK+ module for
Canna and FreeWnn.  The problem of this solution is that
this is valid only for GTK2-based softwares.  Not for basic
softwares such as xterm, rxvt, and emacs, not for KDE softwares,
and not for slow computers which users don't want to use GTK2.

The problem of IIIMF is --- as far as I tested --- that it is 
not easily compiled or very stable.  Hiura, do you have any
plans to provide easy-to-test .rpm and .deb packages of IIIMF-
related softwares which might make users and developers become
interested in IIIMF, want to study it, and want to develop
IIIMF softwares?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-24 Thread Tomohiro KUBOTA
Hi,

From: Juliusz Chroboczek [EMAIL PROTECTED]
Subject: Re: supporting XIM [was: lamerpad]
Date: 13 Mar 2003 01:27:47 +0100

 The problem with IM support under X11 is that the XIM framework
 doesn't make sense.  It defines an overly complex protocol that
 requires both the client and the XIM server to perform dozens of
 useless activities.  Additionally, it defines four only remotely
 related protocols (``styles''), all of which need to be tested
 against.

I don't know about XIM protocol itself, but I don't think it is
difficult to implement an XIM client, nor is it too complicated
than is needed.


One or two styles is enough.  Especially, over-the-spot style
is relatively simple to implement and useful for users.

What points do you think are useless on XIM?  I don't know
why you think so, whether because you really understand XIM or
because you don't know about needed complicity and features
for CJK support.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: lamerpad

2003-03-24 Thread Tomohiro KUBOTA
Hi,

From: [EMAIL PROTECTED] (Janusz S. Bie)
Subject: Re: FYI: lamerpad
Date: 12 Mar 2003 08:02:59 +0100

 The crucial question: does lamerpad work for you or anybody else? 
 
 It doesn't work for me, see below.

You are right.  I tested lamerpad and failed.  It failed in several
aspects.  First, it could not show any Kanji candidates.  Next,
I couldn't check if it works as an XIM server and an XIM client
can connect with it.  Third, it could not use GNU Unifont properly.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: FYI: lamerpad

2003-03-24 Thread Tomohiro KUBOTA
Hi,

From: Glenn Maynard [EMAIL PROTECTED]
Subject: Re: FYI: lamerpad
Date: Tue, 11 Mar 2003 21:13:16 -0500

  Of course, adoptation of Unicode alone cannot make your software
  support CJK languages (more efforts are needed).  I hope Lamerpad
  will help testing softwares and will lead more softwares supporting
  CJK languages.
 
 What more is needed?
 
 Combining (Korean) and double-width characters (in the case of console apps)
 are two things that need special attention, but they're both just parts of
 supporting Unicode.
 
 Other than that, and input method support (which is unreasonably difficult
 at the moment--based on conversations on this list--except in Windows where
 it's merely annoying), what more is needed in the general case?

If you are talking about full support of Unicode including technical
reports and so on, you are right.  However, there are many softwares
which insist supporting Unicode which cannot handle bidi, combining,
doublewidth, more than two or three bytes UTF-8 character, multiple
fonts for multiple scripts, and so on.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: supporting XIM

2003-03-24 Thread Tomohiro KUBOTA
Hi,

From: Maiorana, Jason [EMAIL PROTECTED]
Subject: RE: supporting XIM
Date: Mon, 24 Mar 2003 11:11:31 -0500

 What points do you think are useless on XIM?  I don't know
 why you think so, whether because you really understand XIM or
 because you don't know about needed complicity and features
 for CJK support.
 
 Well, I find most XIM methods to be unstable, and crash alot.
 Plus, they are far too dependant upon locale. I dont see why
 a XIM method should have such fragile dependancies upon the
 locale.
 
 I like to operate under en_US.UTF-8, but I like to enter
 Japanese and vietnamese sometimes. The vietnamese input
 method implemented under GTK+ works fine, no matter which
 locale im logged into. The XIM method for Japanese seems
 only to work under ja_JP.eucjp.

You can send mails to ask for an improvement to support
UTF-8 locales.

Canna:   http://canna.sourceforge.jp/
FreeWnn: http://www.freewnn.org/
Anthy:   http://anthy.sourceforge.jp/
SKK: http://openlab.ring.gr.jp/skk/
XCIN:http://xcin.linux.org.tw/

However, locale-dependence itself is not a bad thing.  For
example, XCIN supports both of traditional and simplified
Chinese depending on locale.  We can imagine about an
improvement that the default mode would be determined by
locale even when it would support run-time switching of
traditional and simplified Chinese.


 Also it crashes alot, probably due to Canna being somewhat
 unstable under rh8. (Start Japanese input and type wildly
 for a second, cannaserver will lock up.)

There seem poorly-implemented XIM clients which cause
locking up of XIM servers.  They are bugs of either
XIM clients or servers.  Please contact developers
of them.


 I think it should be much more stateless, allowing the client
 library to do the rouma/kana conversions, and simply
 having the server anwer queries for possible Kanji, of course
 all in UTF-8. The state of the clients interface should be
 kept on the client side, imo

I think support of UTF-8 locales is a good improvement.
Rouma/kana conversion is not as simple as you think, because
conversion table is configurable in modern conversion engines.
In SKK, rouma/kana conversion and kanji conversion are strongly
connected from users' view and I don't think such separation
can be achieved.

kana-kanji conversion is much more complex.  It is never a
simple thing like one-to-one or one-to-many conversion.
Timing of rouma-kana conversion and kana-kanji conversion
is also a target of improvement for input method developers,
like SKK does.  There are also input methods which don't use
rouma nor kana like T-Code.  It is not a good idea to impose
a standardized communication in the middle of conversion.

There are many input methods with various various ideas and
user interfaces algorithms by input method developers.
Input method protocols must be as extensible as possible to
allow input method developers to realize their ideas.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



FYI: lamerpad

2003-03-11 Thread Tomohiro KUBOTA
Hi,

I hope there are people who are interested in internationalization
and Unicode support including Kanji, but I fear that it is difficult
for non-CJK developers to test Kanji font/display/input/print support.

Lamerpad, http://www.debian.org.hk/~ypwong/lamerpad.html, seems to
be a good way for developers who don't know CJK languages to test
their own softwares whether they support Kanji input or not.

Of course, adoptation of Unicode alone cannot make your software
support CJK languages (more efforts are needed).  I hope Lamerpad
will help testing softwares and will lead more softwares supporting
CJK languages.

Note that I have not tested Lamerpad yet.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Pango tutorial?

2003-02-25 Thread Tomohiro KUBOTA
Hi,

I am now interesting in Pango, because it says it
 - can output anti-aliased text,
 - can handle multilingual text including CJK, bidi, combining,
   and Indic complex scripts,
 - can choose proper fonts for language (script) of (portions
   of) given texts, which means it doesn't force users to configure
   font settings to display non-Latin Alphabet texts,
 - can use multiple fonts for multilingle given text (one font
   for each language/script), which means it can display mixture
   of Japanese and Cyrillic when the system has Japanese font and
   Cyrillic font (even without a font which has both of japanese
   and Cyrillic), and
 - is free (meets the Open Source Definition).

However, I have no idea how to use it.  Are there any tutorials
of Pango?  Or, are there any other text rendering engines which
meet the above conditions?

Concretely, I am now interested in the beta version of xplanet,
which uses FreeType.  However, FreeType is a low level renderer
and it doesn't support bidi nor combining.  It doesn't take care
of supported codepoint/language/script range of fonts.  Thus,
I think FreeType is not suitable for application softwares but
it should be regarded as a basis of other high-level rendering
engines.

Thus, the main developer of xplanet and I are searching a good
text rendering engine and interested in Pango.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: How to read mail with #nnnn

2002-11-18 Thread Tomohiro KUBOTA
Hi,

P class=3DnormalB=E0i n=E0y #273;FONT face=3DTimes New =
Roman#259;/FONTng kh=E1 l=E2u tr=EAn t#7901;=20
b=E1o b#7841;n. #272;#7895; th=F4ng Minh th#7853;t s#7921; kh=F4ng =
xa l#7841; g=EC v#7899;i ch=FAng t=F4ị Anh t#7915; Nh#7853;t khi=20
#273;FONT face=3DTimes New Roman#7871;n Hoa th#7883;nh =
#272;#7889;n th#432;#7901;ng #273;#7871;n nh=E0/FONT ch=FAng =
t=F4i=20

Most of modern HTML rendering engines can decode it.
Thus, you can read it with Mozilla and so on.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux Console in UTF-8 - current state

2002-09-29 Thread Tomohiro KUBOTA

Hi,

At 29 Sep 2002 08:51:55 +0200,
Eric Streit wrote:

  It is probably safe to assume that anybody who wants to avoid framebuffers  
  will not need UTF-8 support, though, so a config option for a stripped  
  down console that way might be useful.
  
 
 
 if we implement a complete graphical environment in the framebuffer...
 it's a way to reinvent X11 ;) 

Well, Linux already has framebuffer-based console.  Our hope is just
to expand it (or develop similar thing) for better support of Unicode.

Though Markus may stick to MES-* set, I think more characters (including
CJK and other Asians) is a good choice.

For example, 18x18ja.bdf in XFree86's CVS today is just about 4MB and
about 600kB when compressed.  When we know that this font (or similar
one, which will be in similar size) is mandatory for CJK people, I
imagine nobody would think this size is too large to be included in
Linux source code.

Other fonts (like Arab, Hebrew, Thai, Khmer, ...) will be considerably
smaller in size than CJK and they must be included.

On the other hand, italics and bolds can be omitted because they are
not mandatory for any countries or languages in the world.  Of course
I never insist Linux *shoud not* support italics and bolds.  I just
pointed italics and bolds will have lower priority.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux Console in UTF-8 - current state

2002-09-29 Thread Tomohiro KUBOTA

Hi,

At Sun, 29 Sep 2002 03:30:47 -0500 (CDT),
[EMAIL PROTECTED] wrote:

 It _is_ too large to be included. The kernel should include a Latin-1
 font (for backwards-compatiblity) and let the user to load a large
 font if they want.

Though I don't understand where is the borderline of too large and
not too large, I understand your idea to limit font to backward-
compatibility range, i.e., Latin-1.

In this idea, kernel will have only ability to handle UTF-8, and fonts
will be supplied in another packages (like Linux Console Tools) if users
need more than Latin-1 (like Euro).  Since most Linux distributions
will have such another package, I think this is reasonable.

I hope the kernel's ability will include support of zero-width and
double-width characters.

Anyway, what I hate is to divide people into two classes, people who
don't need additional files/settings and people who need them.
Japanese users were always forced to read books to configure softwares
to be able to handle Japanese.  I strongly hope that Unicode will
equalize peoples in the world.  To achieve this, we should not spoil
the advantage of Unicode to ISO-2022 --- the unified character set ---
by spliting the code space and saying this code space is needed, that
is optional.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux and UTF8 filenames

2002-09-20 Thread Tomohiro KUBOTA

Hi,

At Thu, 19 Sep 2002 12:15:09 -0400,
Maiorana, Jason [EMAIL PROTECTED] wrote:

 Do you know of any non-graphical input support for japanese?

As Mike said, there are several softwares.  Indeed, I am using
GNU Emacs to input Japanese via ssh login.

 I was wondering if such software exists:
 
   - a text-terminal (non X, non-gui) japanese input method system

There are no such text-terminals which do both of display and input.

Kon2 is a Linux Kanji Console, which enables *display* of Japanese
(EUC-JP or Shift_JIS) but it doesn't have input ability of Japanes.

Jfbterm is a Linux Framebuffer Multilingual Console based on ISO-2022.
However, it also doesn't have input ability of Japanese.

On the other hand, there are several Japanese-input wrapper on tty.
Uum for Wnn, Canuum for Canna, and Skkfep for SKK.  Since they are
tty softwares, they themselves don't have display ability.  They
owe it to the terminal.


   - a batch kanji picker:
   its easy to take a quantity of roomaji and turn them
 into
   kana, but is there a command line tool, or anything
 which
   could take kana and produce kanji's?

Impossible, because various kanji can be candidates for same kana.
(Possible, with interaction with users.  Display these candidates
and let the user choose one.  Conversion process needs more work,
because it has to divide series of letters into words.  This work
needs grammatical analysis.  You know, Japanese language doesn't
use whitespace to separate words.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Linux and UTF8 filenames

2002-09-19 Thread Tomohiro KUBOTA

Hi,

At Thu, 19 Sep 2002 10:17:35 -0400,
Maiorana, Jason [EMAIL PROTECTED] wrote:

 I dont think that IIIMF is really going to address the console issue
 at that level. (Also it uses UTF-16 internally, anyone else find
 that wierd for Unix software?)

Though I don't know how IIIMF is good or bad, I don't know any
alternatives which can input Chinese and Japanese.  I agree that
UTF-16 is a bad choice but it is not fatal, while no possibilites
of support for Chinese and Japanese (any keymap-like approach can
never support these languages) is fatal.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Perl 5.8 with significantly improved UTF-8 support is out

2002-07-23 Thread Tomohiro KUBOTA

Hi,

At Tue, 23 Jul 2002 13:54:59 +0100,
markus kuhn wrote:

 Perl 5.8 is out!

A good news.  I will have to try it...

Does it support LC_CTYPE ?


 Another major milestone reached ... I guess the emacs-unicode is now the
 only one left ...

Linux console's Unicode support is very poor.  It can handle only
a few hundreds of characters, and cannot handle combining nor doublewidth
characters.  It doesn't have API for CJK input methods.

Another one is Tcl/Tk.  I cannot input Japanese in entry and text
widgets by using XIM.  Something should be wrong even now, though
it may be specific problem for Debian package

Extended input method is also needed.  For example, I cannot input
both of Japanese and Korean in one xterm session, because there are
no XIM servers which support both of Japanese and Korean while
xterm cannot switch XIM connection.  (mlterm can do this, but I
think all softwares should be able to do this.)

Many softwares should be rewrote using internationalized widget
libraries such as Pango to support complex languages.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




XTerm patch to call luit (2)

2002-06-12 Thread Tomohiro KUBOTA

Hi,

Here is a new patch.

- Default of locale resource is changed from true to false.
  (I still have no idea which is the best...  See below.)

- Locale-related resource set-up is separated from VTInitialize()
  to a new function VTInitialize_locale().

- Added vi (Vietnamese) for luit-using locales in medium mode.

- Use nl_langinfo(CODESET) if available.
  (Definition of HAVE_LANGINFO_CODESET is not implemented yet.
   Could you help me, Bruno?)

- Use MB_CUR_MAX if available.

- Implemented mystrcasecmp() instead of using locale_str.

I heard that from a Japanese person that locale resource should
be false for some time till the resource will be famous to avoid
confusion.  When it will be famous, (he said) many people would
think the default should be true and then the default can be
changed to true without annoying people.

I think this opinion can be integrated into Juliusz's opinion that
when some new font mechanism will be dominant, the default should
be changed to true.

So, how do you think about the default of false?


The following is the new patch.  Please note this patch is for
my previous patch.


begin 644 xterm-20020604-luit2.diff.gz
M'XL(/?F!ST`WAT97)M+3(P,#(P-C`T+6QU:70R+F1I9F8`K5E]4^)($_\;
M/L46W6`.POJ5XN(J_H6(*[W=KQ1@FDMN0\1!U]OSNU_WS.0-$$]
M:VM)9KI[KI[NG^=5FF2:@WNR#?`^9-:$-1LJ.LD7MF16\-\:Z-_5HVXL
MFVXDYHN4TAP1A%X1GZ;.83LD$9SO[WKS0)$A=KM5J_(A9;1)%W6\OR*8
M/WP@=+NE;NZ2FOCY\*%(\._'E^!B%IBM3?(EZ%[+AZM3)]@DOO4W\V*Y035
M34$,4WW3]!G\7_FJGRV1OLN(4!L8TX=96D3=\02/%#P+.^TU2:I:
M-XL41P\T'AT:NLWXZF?R6+Y1EV,:W[@`5O@S5@)!-9SGJML]9R2+K
ML67#'A(PH$%,.'?6.S_K=SEGO^/1LV+L*!1Y*@8_6B'7!#3Z7]CE^DU;!
MT00\=LT%O7L!#HK*1T^?CBU#/DT3(U(09Q.NX^AU?7*`O=N]7;F[N
MD!K_5165N_C=+;@,(Y5J#Q[3OZA%7)H\,Z:_X3`Z(Q_X_8WX@WH'.2/+
M+))B[9UEDO[E4/M\M33NB=JTQ]GZ#.YDJGN,^-RMH(5U[[CX2`S=9V3C
M?;%VZ+HV`=)CM#\Y.PAZ#ON!)?'70`QQZLD3B$)8S*MH)E@/R(:O
M;I+4Z-:K/T00!Q26SF@/[`#`\57ZVVQ13^22*35)#HIYBJ4:WCH.9YW!M
M@$60FJY'*FT4R?^C%)8J('_@SJ93YE5`FUJMBJ+BD0:,(!T*694CG$SP.`S
MCON!'E@7M4;'V:7@*KK%TPZ;)F(N;1W\3;GCD^F)N-A%TV!(O-UJ=
MZ0ZQ?`W/8:0%[J(B*,%()8Y#Y7265?KG)V5P!S@$(7\\T\H#=_+7Y5RE4@+
M+.7M#G^_[.5QKUB]_$QC[L05UGYVU0Y/*:.7'%.9M4B!2EE*,O.9QAS#
M'6$B(,[)/.IYZ6IQ'/U;O^H-^@-I;V$_8X@H$)0G8FJT[]Y9CNA5)
M6H7TCT\IJV2V`0WSL?EZ0(7?QD]@^?EGC$7X5PE361'D@A1_!0F/)@Z
M)DMIG(YH129%@PX[J,_8M7L?W[R*KW,][*8I:U7L*BAK0FA@+DE\Q
M'XL#Z1KQ-S9MM/Q+UC3^CJ,$:)M#]FA[.TC7QN5T94`7ATSQ2%4FC^?EJ
M`0-9S/!`6^R*.\9*1,)9/VIX]/GX+^T5_7^NY2E[N(UF?L+B@8*2@
M#[H-+G1-$AO,\@DS368$UH-T79Z?T*(K''!`FD*/\#PEPB#79Q)!L)*8J[)
M\ZL]?=X[.KT^?XNO!KWST\3)2IM^PD;6;)L+T^)/!RN`R'2R!/141W1J3C
M6U!+N.FCHL.+KA[E46[SID-2ZV8J))?B1J-DK9OY4;C8:VSLU^:;M..
M:.2N/-G:+)F'`'KP``CCYH=:]OM+..U^DE'@`U8)=(`$)_*%%_E;GJ
M9;V,Q2L:_B:'W7+LOCR;SDY+DNY(L.GB8(D43S\((M\EI645]JE?_*TF^,
M_V1!?57X@X#6TM`78V1+YZXOFOEE1D=;F@XR!_='U1CY!H.RQ]T;P?'1
M(:F%2(2#[ZGN^S`N`1[N5Q]UBHRA97P1PTU1FCQG,`9-[5Q\'UX/$2I
MDCB5)'%J16+31[E0V'_LM*#O@`9D:P;D;###$IX=JNI[G?VG'@7\QOKM
M(EW2;=`5B)=W'$`P!;_;@W*QL8O@BL+4_F.2=Q80\`TX(;#`+R5AP1+O
M:%P'1K$E;`Y*I%'OLE-I'?(R)7=[`8Q+6:/Z-T3Y0_@4M1N$O*$OFD#\QZ
MP$,`O8WI$\4)^TT=S=;*IAKMP%VVQ/M6J1UEC%X(\[QH76LBIM\GJT3]^$
M]NF;T#Y=0/LTZLJL`Z5-K%_4/?@?6[`?:4U#SULW).[-LG($4$D]JE+C.X_I
MW^1RS^(G+5?=NQY@.9G2+I._T%C1TD0E5IP$/30@9=K)G4J41B;9'#/
MI$NB/H*%$ZS8/*KDCZ*\!`J_3-2HDS4NE=*Z*9=AC'I[3.82\X+04L^C
MONB_@%A9ZIK7(6Y:R+/TH@[S@MK.#C7)3$2I[E(G!:U_7A`O#.L_02G$W?
MB+/I(LYNSQ^6;(@-HU+I+#,?1$=-$D(78G'.4OY-PD].^6$72)@^C\YVT_C
MU:$/EG*]$T6XTC).^PK@%G:1+DOSS!K*.*\E)5_BN9$5'\+7NM%5?%/'S
M4#LOX.0-U%UOGYXP7I5R[X%F1-V\M![G(6F((#K7?=D-O^:V0V`:P[DY
ME9!D4%RR?!K]ZIXR;(6F;I14-*4B;Z)S[0C!@XIB,N'!0E@0O'M?(XB
M[06,@I'EZW=PM,V9Q^,!V)U['G.D,+SJ='N52LF*C#]*Q.57!U!HVD\RP#*7
M5!(;R]($@C_Q%^-='=F$VT:\$U5H-YKW?YYXX34-Y1A#!_M/):1H\?:^/
MEU_YB;G,ZT0QG;Q*5)5]I;FO[.9)4KSV`S?HW8((JRO]V*[BQ5;6RKO
M\\('WK@LMBN)[L[4'MMS[W?25%W$7^W$0RXI--WT8YTN0UUC#4*$8
ML4*L)P']QN+._#,Y9-,,$%H=/K=:6Y#B_L*3$+JO*%2V*@!.G+=VP9;2
M)U6=ZL$X)3UJ%4/+GUY7@^U\][PI'^4MO7W=DZD\;'Z!':[-BBZQXBR@*
MQYXE0FX7(T=I[\WD(NYDU76M?;YO)2ZOU:W-MYRU^,N3X_^P+(^C-C
MC#7@]I:?]7*9(ZO;V]-!O]7:WJ,-'FT;)O,S89P#0BNA%?'7F*H?@1.T2
M3)#WQH%T/?XAH1BQU@GIN@XDA@`R?+YQP!)%PAE4];`J.`Q@8+5$O
MTN8P82IS^P`4?CM+089SM069CB4QRE2'UZ2%OPAN$DKJ])Y:MY:=BPR:_F
M(4E;%=+13*8,L,R+;SSR`08++!`\M\6%$6L)PG0?FH3S3R?OL8QR45_
)`?.V@0,(0``
`
end




---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n][call for comments] XTerm patch to invoke luit

2002-06-07 Thread Tomohiro KUBOTA

Hi,

At Fri, 7 Jun 2002 15:06:09 +0200 (CEST),
Bruno Haible wrote:

 CP1255  (Hebrew)
 CP1258, TCVN  (Thai)
 
 Either you hardwire them, or you document that xterm should not be
 used with 8-bit fonts in these encodings. (Are there 8-bit fonts for
 CP1255, CP1258, TCVN at all??)

For TIS-620 (ISO-8859-11) Thai, I don't like documentation way
because luit already supports TIS-620 and Thai people apparently
benefit from it.

For CP1258 and TCVN Vietnamese, I think luit will easily support
them, though it doesn't support them now.

For Hebrew, I don't think we have to care about it so far, because
XTerm doesn't support bidi and we are still not agreed whether to
support bidi or not.

I can add ISCII for complex 8bit encodings list.  However, since
XTerm doesn't support complex Indic scripts, I think it can be
neglected so far.


IMO, documentation way should be avoided as far as possible.
It is because, if we need to write a documentation for a language,
speakers of the language will probably need to read tens of documents
to use tens of softwares.  It is just Japanese people are localted
and I imagine people from other countries such as Thai and Vietnam
are also.


Thus, I think hard-coding of th and vi is a good way so far.

And also, I heard that systems without locale (with X_LOCALE)
do not have MB_CUR_MAX.  If it is true, we also have to have
a fallback for this.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




[call for comments] XTerm patch to invoke luit

2002-06-06 Thread Tomohiro KUBOTA
/HM;02XU:-24NPU],FO3
MZM3(,1_*_YP93`2BDAMN7P`,SXK1)=O6#PGL!Q_::Y07%4C6;W4BHFBT=
MN5'6%7]WJE2:O]$*7.YAQ^FC[\'!N*0XH+BTN1O_%#-]3\(N;(RYZPP+S1C
MB@6*18HEBF6*%:'?A(.:G+3;$/N[FXSB5+)KTA=V=#J`1X=L[#3*_$B%IH
M+(HH-$6;8HNB0]EV':WH@+6J%NJ$#ZR22MDZ('0:D,A(N]U)LHT^R,9B.:
MC6@VHMF(9B.ZXPBPG:X$?*VN]KJ;.%`;7CO]2JU_(5LI3QF.^`SURC;3=0K
M:[\\Y%,DL=1-4:C)IDC/AVZKWK'YJPVUXR824/FO5U=1T,AZ0Q3@!\WL?C
MY',XQ-7;!IH??R+6['5Z#_7M-%=Z\%6$W;F8FOF)OLL05;*NTG7XPSF
0_IJDQ36]07W(1.9J$`
`
end

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n][call for comments] XTerm patch to invoke luit

2002-06-06 Thread Tomohiro KUBOTA

Hi,

At Thu, 6 Jun 2002 18:53:34 +0200 (CEST),
Bruno Haible wrote:

 The default should follow the locale settings. In detail:
 
   - If MB_CUR_MAX == 1:
 
 Look at the specified main font. If it is an 8-bit font,
 use mode 1. Otherwise use mode 3.
 
   - If MB_CUR_MAX  1:
 
 If nl_langinfo(CODESET) is UTF-8, use mode 2.
 Otherwise use mode 3.

I think your opinion is to use this algorithm for medium mode
and use this mode for default.

This algorithm is better because it does not hard-code any
locale names.  However, this algorithm does not work well for
Thai, for which I'd like to use 3. UTF-8 with luit behavior.

Do you have any idea to include 8bit encodings which need
special processings such as combining?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-12 Thread Tomohiro KUBOTA

Hi,

At Fri, 10 May 2002 14:58:21 +0200 (CEST),
Bruno Haible wrote:

 Why is it more harmful if U+00A5 is an escape character than if U+005C
 is an escape character? In both cases you just double it to get the
 original character.

I think you mean that softwares which treat U+005C as an escape character
should be modified to treat also U+00A5 as an escape character.
Am I right?  Then, there should already exist data which contain U+00A5
which doesn't intent to be an escape character.


 So it is a minor annoyance over the time of a few months, but by far
 not the costs that you are estimating.

For personal users, I think most people will accept the costs. 
However, Unicode is not only used by personal users, but also
used by company users.  They won't accept such costs.  Think
about Y2K problem.  Companies, especially banks, electricities,
gases, and so on had to take extreme care and huge costs.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-12 Thread Tomohiro KUBOTA

Hi,

At Fri, 10 May 2002 14:17:04 -0400,
Glenn Maynard wrote:

 The problem isn't the conversion costs, it's the fact that Windows will
 continue to use the characters incorrectly, and will reintroduce the
 problem continuously.

Right.  Microsoft *never* change their modified version of Unicode.
What we can do is to call the encoding as non-Unicode, though they
call it Unicode.


 It wouldn't help people that actually
 need to *use* the Yen symbol, since there'd still be no way to input the
 real single-width yen symbol, though it might be possible to add that to
 the input method.

I think input method is not problem now.  It is because (1) In Japanese
version of Windows, *only* subset of Unicode which has conversion to
CP932 is used because Unicode is limited to internal processing and
text files which users treat are almost always CP932, and (2) if encoding
or mapping table is changed, then input method should be modified as
a matter of course.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: ASCII and JIS X 0201 Roman - the backslash problem

2002-05-12 Thread Tomohiro KUBOTA
Hi,

At Fri, 10 May 2002 15:33:13 -0400,
Glenn Maynard wrote:

 Out of curiosity, Tomohiro, is full-width Yen commonly used?  (I'd guess $B1_(B
 would be a more obvious choice for full-width.)

If you mean Unicode U+FFE5 by "full-width Yen", I cannot give an
answer because Unicode itself is not yet very popular in Japan.
However, full-width Yen in Shift_JIS and EUC-JP, i.e., 0x216F in
JIS X 0208, is widely used in Japan.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Switching to UTF-8

2002-05-06 Thread Tomohiro KUBOTA

Hi,

At Mon, 6 May 2002 07:46:33 +0200,
Pablo Saratxaga wrote:

  In Hiragana/Katakana, processing of n is complex (though
  it may be less complex than Hangul).
 
 No. The N is just a kana like any other, no complexity at all involved.
 Complexity only happens when typing in latin letters. That is why
 the use of transliteration typing will always require an input
 method anyways, it cannot be handled with just Xkb.

In my above sentence, n is a Latin letter.  It may correspond to
HIRAGATA/KATAKANA LETTER N *or* 1st key stroke to n-a, n-i, n-u, n-e,
n-o, n-y-a, n-y-u, or n-y-o.  (Key strokes of n-y-a should give
HIRAGANA/KATAKANA LETTER NI and following HIRAGANA/KATAKANA LETTER
SMALL YA.)

Anyway, I understand your point that Latin - Hiragana/Katakana
cannot be implemented as xkb.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Tomohiro KUBOTA

Hi,

At 02 May 2002 23:54:37 +1000,
Roger So wrote:

 Note that the source from Li18nux will try to use its own encoding
 conversion mechanisms on Linux, which is broken.  You need to tell it to
 use iconv instead.

I didn't know that because I am not a user of IIIMF nor other Li18nux
products.  How it is broken?


 Maybe I should attempt to package it for Debian again, now that woody is
 almost out of the way.  (I have the full IIIMF stuff working well on my
 development machine.)

I found that Debian has iiimecf package.  Do you know what it is?


 I don't think xkb is sufficient because (1) there's a large number of
 different Chinese input methods out there, and (2) most of the input
 methods require the user to choose from a list of candidates after
 preedit.
 
 I _do_ think xkb is sufficient for Japanese though, if you limit
 Japanese to only hiragana and katagana. ;)

I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean.  Thus, I think
native Japanese speakers won't decide to abolish Kanji.
(Please don't be kidding in international mailing list, because
people who don't know about Japanese may think you are talking
about serious story.)

Even if we limit to input of hiragana/katakana, xkb may not be
sufficient.  For one-key-one-hiragana/katakana method, I think
xkb can be used.  However, more than half of Japanese computer
users use Romaji-kana conversion, two-keys-one-hiragana/katakana
method.  The complexity of the algorithm is like two or three-key
input method of Hangul, I think.  Do you think such an algorithm
can be implemented as xkb?  If yes, I think Romaji-kana conversion
(whose complexity is like Hangul input method) can be implemented
as xkb.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Tomohiro KUBOTA

Hi,

At Sun, 5 May 2002 19:12:31 -0400 (EDT),
Jungshik Shin wrote:

  I believe that you are kidding to say about such a limitation.
  Japanese language has much less vowels and consonants than Korean,
  which results in much more homonyms than Korean.  Thus, I think
 
   Well, actually it's due to not so much the difference in
 the number of consonants and vowels as  the fact that Korean has
 both closed and open syllables while Japanese has only open syllables
 that makes Japanese have a lot more homonyms than Korean.

You may be right.  Anyway, the true reason is that Japanese
language has a lot of words from old Chinese.  These words
which are not homonyms in Chinese will be homonyms in Japanese.
(They may or may not be homonys in Korea.  I believe that 
Korean also has a lot of Chinese-origin words.)  Since a way to
coin a new word is based on Kanji system, Japanese language
would lose vitality without Kanji.

   I don't think Japanese will ever do, either.  However, I'm afraid
 having too many homonyms is a little too 'feeble' a 'rationale' for
 not being able to convert to all phonetic scripts like Hiragana and
 Katakana.
 ...

Since I don't represent Japanese people, I don't say whether it is
a good idea or not to have many homonyms.  You are right, there
are many other reasons for/against using Kanji and I cannot 
explain everything.

Japanese pronunciation does have troubles, though it is widely
helped by accents or rhythms.  However, in some cases, none of
accesnts or context can help.  For example, both science and
chemistry are kagaku in japanese.  So we sometimes call
chemistry as bakegaku, where bake is another reading of
ka for chemistry.  Another famous confusing pair of words
is private (organization) and municipal (organization),
which is called shiritu.  Thus, private is sometimes
called watakushiritu and municipal is called ichiritu,
again these alias names are from different readings of kanji.
If you listen to Japanese news programs every day, you will
find these examples some day.

These days more and more Japanese people want to learn more
Kanji to use their abundance of power of expression, though
I am not one of these Kanji learners.


   I also like to know whether it's possible with Xkb.  BTW, if
 we use three-set keyboards (where leading consonants and trailing
 consonants are assigned separate keys) and use U+1100 Hangul Conjoining
 Jamos, Korean Hangul input is entirely possible with Xkb alone.

Note for xkb experts who don't know Hiragana/Katakana/Hangul:
input methods of these scripts need backtracking.  For example,
in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
v: vowel) sequence.  When I hit c-v-c, it should represent one
Hangul syllable c-v-c.  However, when I hit the next v, it
should be two Hangul syllables of c-v c-v. 

In Hiragana/Katakana, processing of n is complex (though
it may be less complex than Hangul).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-02 Thread Tomohiro KUBOTA

Hi,

At Thu, 2 May 2002 02:14:29 -0400 (EDT),
Jungshik Shin wrote:

   You mean IIIMF, didn't you? If there's any actual implementation,
 I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X
 style keyboard/IM switching mechanism/UI so that  keyboard/IM modules
 targeted at/customized for each language can coexist and be brought up as
 necessary. It appears that IIIMF seems to be the only way unless somebody
 writes a gigantic one-fits-all XIM server for UTF-8 locale(s).

I heard that IIIMF has some security problems from Project HEKE
people http://www.kmc.gr.jp/proj/heke/ .  I don't know whether
it is true or not, nor the problem (if any) is solved or not.

There _is_ already an implementation of IIIMF.  You can download
it from Li18nux site.  However, I could not succeeded to try it.
Since I have heard several reports of IIIMF users, it is simply
my fault.

There seems to be some XIM-based implementations which can input
multiple complex languages.

One is ximswitch software in Kondara Linux distribution.
http://www.kondara.org .  I downloaded it but I didn't test it yet.

Another is mlterm http://mlterm.sourceforge.net/ which is entirely
client-side solution to switch multiple XIM servers.  Though I
don't think it is a good idea to require clients to have such
mechanisms, it is the only practical way so far to realize multiple
language input.


   How about just running your favorite XIM under ja_JP.EUC-JP while
 all other applications are launched under ja_JP.UTF-8? As you know well,
 it just works fine although the character repertoire you can enter
 is limited to that of EUC-JP. Of course, this is not full-blown UTF-8
 support, but at least it should give you the same degree of Japanese
 input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then
 you would say what the point of moving to UTF-8 is. You can at least
 display more characters  under UTF-8 than under EUC-JP, can't you? :-)

There are, so far, no conversion engine which requires over-EUC-JP
character set.  Thus, EUC-JP is enough now.  If someone wants to
develop an input engine which supports more characters, he/she will
want to use UTF-8.  However, I think nobody feels strong necessity
of it in Japan, besides pure technical interests for Unicode itself.


   BTW, Xkb may work for Korean Hangul, too and we don't need
 XIM  if we use 'three-set keyboard' instead of 'two-set keyboard' and can
 live without Hanjas.  I have to know more about Xkb to be certain, though.

I see.  This is not true for Japanese.  Japanese people do need
grammar and context analysis software to get Kanji text.
How about Chinese?


---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Tomohiro KUBOTA

Hi,

At Wed, 01 May 2002 20:02:57 +0100,
Markus Kuhn wrote:

 I have for some time now been using UTF-8 more frequently than
 ISO 8859-1. The three critical milestones that still keep me from
 moving entirely to UTF-8 are

How about bash?  Do you know any improvement?

Please note that tcsh have already supported east Asian EUC-like
multibyte encodings.  I don't know it also supports UTF-8.

How about zsh?


For Japanese, character width problems and mapping table problems
should be solved to _start_ migration to UTF-8.  (This is why
several Japanese localization patches are available for several
UTF-8-based softwares such as Mutt.  We should find ways to stop
such localization patches.)

Also, I want people who develop UTF-8-based softwares to have
a custom to specify the range of UTF-8 support.  For example,

 * range of codepoints
U+ - U+2fff?  all BMP? SMP/SIP?

 * special processings
combining characters?  bidi?  Arab shaping?  Indic scripts?
Mongol (which needs vertical direction)?  How about wcwidth()?

 * input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?

 * fonts availability
   Though each software is not responsible for this, This software
   is designed to require Times font means that it cannot use
   non-Latin/Greek/Cyrillic characters.

Though people in ISO-8859-1/2/15 region people don't have to care
about these terms, other peole can easily believe a UTF-8-supported
software and then disappointed to use it.  Then he/she will become
distrust UTF-8-supported softwares.  We should avoid many people
will become such.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Tomohiro KUBOTA

Hi,

At Thu, 2 May 2002 00:16:25 -0400,
Glenn Maynard wrote:

   * input methods
  Any way to input complex languages which cannot be supported
  by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
  Or, any software-specific input methods (like Emacs or Yudit)?
 
 How much extra work do X apps currently need to do to support input
 methods?

Much work.  I think this is one problematic point of XIM which
caused very few softwares (which are developed by XIM-knowing
developers, who are very few) can input CJK languages.

X.org distribution (and XFree86 distribution) has a specification
of XIM protocol.  However, it is difficult.  (At least I could not
understand it).  So, for practical usage by developers,
http://www.ainet.or.jp/~inoue/im/index-e.html
would be useful to develop XIM clients.  I have not read a good
introduction article to develop XIM servers.

I think that low-level API should integrate XIM (or other input 
method protocols) support so that XIM-innocent developers (well,
almost all developers in the world) can use it and they cannot
annoy CJK people.  Gnome2 seems to take this way.  However, I
wonder why Xlib doesn't have such wrapper functions which omit
XIM programming troubles.


 It's little enough to add it easily to programs, but the fact that it
 exists at all means that I can't enter CJK into most programs.  Since
 the regular 8-bit character message is in the system codepage, it's
 impossible to send CJK through.

Well, I am talking about Unicode-based softwares.  More and more
developers in the world start to understand that 8bit is not enough
for Unicode because it is a unversal fact.  I am optimistic in this
field; many developers will think 8bit character is a bad idea in
near future.  However, it is unlikely many developers will recognize
the need of XIM (or other input method) support in near future because
it is needed only for CJK languages.  My concern is how to force thse
XIM-innocent people to develop CJK-supporting softwares.


 How does this compare with the situation in X?

Though I don't know about Windows programming, I often use Windows
for my work.  Imported softwares usually cannot handle Japanese
because of font problem.  However, input method (IME?) seems to be
invoked even in these imported softwares.


   * fonts availability
 Though each software is not responsible for this, This software
 is designed to require Times font means that it cannot use
 non-Latin/Greek/Cyrillic characters.
 
 I can't think of ever using an (untranslated, English) X program and having
 it display anything but Latin characters.  When is this actually a problem?

For example, XCreateFontSet(-*-times-*) cannot display Japanese
because there are no Japanese fonts which meet the name.  (Instead,
mincho and gothic are popular Japanese typefaces.)  Such
types of implementation is often seen in window managers and their
theme files.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Renewed my Unicode/JIS page

2002-04-07 Thread Tomohiro KUBOTA

Hi,

I revised my Unicode/JIS web page.

http://www.debian.or.jp/~kubota/unicode-symbols.html

I used new EastAsianWidth and mapping tables which are downloadable
from the Internet.  I rewrote my documents on the basis that Unicode
Consortium has never released an official mapping tables between Unicode
and east Asian encodings.  I also mentioned VARIATION SELECTORS which
is introduced in Unicode 3.2 .

Please read and check it.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: 3.2 MAPPINGS/EASTASIA

2002-04-04 Thread Tomohiro KUBOTA

Hi,

At Thu, 4 Apr 2002 11:58:57 +0200 (CEST),
Bruno Haible wrote:

 Thanks a lot for these pointers! With this information, I can write a
 JISX0213 converter for glibc and libiconv.

Please note that these tables may unofficial.
Though jisx0213code.txt insists that it is built from
official JIS X 0213 standard, it insists that 1-1-29
is changed from U+2015 to U+2014 because of JIS X 0221
standard.  The JISX0213 InfoCenter web page insists that
it should be a bug of JIS X 0213 standard.  Also, since
JIS X 0213 standard was released in 2000, the official
mapping table should have unmapped characters.

According to README.txt file in the IBM1394 archive,
it should be related to CP932.  Thus, I don't think it
is a good source as an official JIS X 0213 mapping table.

I think you can use either of them (or a combination of
them).  However, it is with a risk.  I imagine a new
version of JIS X 0213 will be available in afew years and
it will have a complete official mapping table.  In this
case, mapping table of glibc and libiconv will have to
be changed.

You can wait for the official mapping table or you can
implement a tentative table from jisx0213code.txt and
IBM1394.  Either will be OK.


 I'll make use of these 59 compatibility ideographs in the converter.
 That's the whole reason why they were introduced in Unicode 3.2.

Right.  The problem is, there are no official mapping tables
which use them yet.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: 3.2 MAPPINGS/EASTASIA

2002-04-02 Thread Tomohiro KUBOTA

Hi,

At Tue, 2 Apr 2002 15:36:16 +0200 (CEST),
Bruno Haible wrote:

 Does this also apply to JISX0213:2000? Do you know where to find the
 conversion tables for this character encoding? The PDF file in the
 ISO-IR registry contains only the pictures of each glyph, but no
 conversion table.

I found

http://www.jca.apc.org/~earthian/aozora/0213.html
http://www.jca.apc.org/~earthian/aozora/jisx0213code.zip

but I don't know this is authorized one (or informative part of
JIS standard) or merely prepared by one person.


Also, I found 

http://www.cse.cuhk.edu.hk/~irg/
http://www.cse.cuhk.edu.hk/~irg/irg/N807_TablesX0123-UCS.zip

It apparently includes IBM extended characters.


Strictly speaking, JIS X 0213:2000 *cannot* be defined as a mapping
table against ISO 10646, because JIS X 0213's han unification rule
is different from ISO 10646's one.  (You know, Unicode added several
tens of compatibility ideographs which are different characters in
JIS X 0213's point of view and different glyphs of the same character
in Unicode's point of view.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




sorting order of Kanji

2002-02-25 Thread Tomohiro KUBOTA
Hi,

At Mon, 25 Feb 2002 17:24:20 -0500,
Glenn Maynard wrote:

 Kanji appear to be getting collated, however:
 
 05:13pm [EMAIL PROTECTED]/2 [~] sort
 $BF|K\(B
 $Be:No(B
 $BF|K\(B
 (eof)
 $BF|K\(B
 $BF|K\(B
 $Be:No(B
 
 (I couldn't tell if that's the correct collation order, but it's clear
 they're being reordered, where the hiragana above are not.)

It is impossible to collate Kanji by using simple functions such
as strcoll(), because one Kanji has several readings depending on
context (or word) in most cases.  (This is Japanese case).
(It is technically virtually impossible.  It will need natural
language understanding algorithm.)

For Korean, one Kanji (Hanja) has one reading in most cases,
though there are exceptions.  However, if we ingore such exceptions,
strcoll() will work by using reading table for all Ideogram characters.
(Though it is technically possible, it will need a large dictionary).

I don't know about Chinese.

Thus, strcoll() simply works as strcmp().

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Variation selectors for narrow/wide EastAsian glyphs

2002-02-04 Thread Tomohiro KUBOTA

Hi,

At Mon, 04 Feb 2002 11:40:53 +,
Markus Kuhn wrote:

 One potential alternative is that, given Unicode 3.2 has just
 introduced the notion of variation selectors, we ask the
 UTC and WG2 to consider the addition of two special variation
 selectors for single-width and double-width selection of glyphs
 in the East Asian ambiguous class.

Interesting.  I have a few comments.

1. The range of characters for which I want to use doublewidth version
   is not limited to EastAsianAmbiguous class.  The list of such
   characters depend on Unicode - local encodings mapping tables
   and we don't have authorized reference mapping tables.  Thus, I
   cannot show an exact list of such characters.  However, if we
   want to support Japanese mapping tables in
   http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA
   which are widely used now, characters in Width problems in
   http://www.debian.or.jp/~kubota/unicode-symbols.html
   should be supported by your Width Selector.  (I checked
   Japanese mapping tables only.  Checking for Chinese and Korean
   tables may add characters to the list.)  Thus, I think it is a
   good idea not to limit the range for which Width Selector is
   effective.  (Another idea is to change EastAsianWidth definition.
   However, my proposal to change EastAsianWidth has failed...)

2. I am afraid that your proposal (or proposal to change ISO 6429)
   may take long time to be realized.  It does not mean that it is
   not a good idea to propose Width Selector.  I mean, we need
   some temporary solution because this is a practical problem,
   rather than a standardization problem.

3. I think your proposal is better than your SCW proposal because
   this proposal is STATELESS, though SCW proposal can be simplified
   to be stateless.


 That would be most easy to
 implement with existing font display engines that feature ligature
 substitution. That would be a way of allowing applications or
 encoding translation filters to have tight control over the
 width of a character on a character cell terminal, without
 the introduction of new ESC sequences. The a font could easily
 contain both narrow (CP437) and wide (JIS) versions of the
 U+25xx box drawing characters, etc.

I don't think introduction of new 'character' is better than
introduction of new ESC sequences.  I think they are equivalent.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Li18nux Locale Name Guideline Public Review

2002-01-21 Thread Tomohiro KUBOTA

Hi,

I found the 2nd public review of Li18nux Locale Name Guideline
has started.

http://www.hauN.org/ml/b-l-j/a/800/840.html
http://www.li18nux.org/subgroups/sa/locnameguide/index.html

The page says that comments are welcome until 14 Feb 2002.

Any additions from Li18nux insiders?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n]Li18nux Locale Name Guideline Public Review

2002-01-21 Thread Tomohiro KUBOTA

Hi,

At Mon, 21 Jan 2002 19:18:09 +0900,
Tomohiro KUBOTA wrote:

 I found the 2nd public review of Li18nux Locale Name Guideline
 has started.
 
 http://www.hauN.org/ml/b-l-j/a/800/840.html
 http://www.li18nux.org/subgroups/sa/locnameguide/index.html

One important note.  I am not a member of Li18nux.  Thus,
people who have opinions should write it to Li18nux.  The
above web page writes how to comment.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-13 Thread Tomohiro KUBOTA

Hi,

At Sun, 13 Jan 2002 03:38:55 -0600 (CST),
[EMAIL PROTECTED] wrote:

 Not allowing any upgrade path from CP932 to Unicode is going to
 encourage them to stick with CP932, and that hurts *everyone*.
 
 There is an upgrade path; intellegently convert the character. I think
 fixing the problem now is better than everyone dealing with it for the
 next 40 years.

If you think so, please persuade Microsoft.

BTW, It is Unicode which introduced the distinction between Shift_JIS
and CP932 and confused us.  Without Unicode, the only difference between
Shift_JIS and CP932 is that CP932 has some additional characters.

Thus, it is wrong to say that This is a problem of CP932, and Unicode
is not responsible.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-12 Thread Tomohiro KUBOTA

Hi,

At Sat, 12 Jan 2002 03:13:00 -0600 (CST),
[EMAIL PROTECTED] wrote:

 Some people apparently think there's a need (or at least, in the reverse).
 My preference, as a native speaker of neither of these languages, would be
 to display Japanese with a Japanese font and Chinese with a Chinese
 font, and I would be surprised if there were very few people with this
 preference.
 
 I'd prefer my KISS CD's to be displayed in a KISS font, too. That doesn't
 neccessarily mean that it's feasible, or worthwhile to be put in a spec.

How many times I heard such an ignorance on Han characters...  Ok,
it is natural all of us are basically ignorant on non-native languages
unless we study them.

The concept of Han Variants is never like such a personal preference.
It is nearly like the difference of characters.  The term font and
glyph is merely based on the Unicode's view that Han Variants are
same characters and thus the distinction of Han Variants will be
achieved using change of fonts and glyphs _technically_.

For example, do you think good (english word) and guten (german
word) is a same word or different word?  Han Variants are like that.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-12 Thread Tomohiro KUBOTA

Hi,

At Fri, 11 Jan 2002 11:42:56 +0100,
Kent Karlsson wrote:

 No it's not.  And I was speaking as a matter of principle.
 If you are talking about the reference glyphs, then it the
 responsibility of whoever is complaining about them to point
 to the *actual* reference glyphs, not some other glyphs,
 that may or may not be the same as the reference glyphs.
 It should not be necessary for the *reader* to try to find
 out if the glyph referred to is sufficiently the same as
 the reference glyph(s) or not for the argument put forward.

You are basically right.  However, the concept of Unification
is that the reference glyphs (written in the standard book)
don't have special importance than other unified glyphs.

I noticed I had one wrong idea.  I am very sure that the
low resolution image I suggested is more than enough as a
basis of discussion on Han Unification.  However, I didn't
noticed that I can say that because I am native Japanese
speaker and have trained tens of years to read Han Ideographs.
Now I noticed it is natural you don't understand whether
the low-resolution image is enough or not.

In real, difference of Han Variants is obviously distinguishable
even in 16x16 pixel fonts which we often use with X Window System.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-11 Thread Tomohiro KUBOTA

Hi,

At Fri, 11 Jan 2002 04:51:35 -0800,
Edward Cherlin wrote:

  For example, I can write
  the cost is \100 and the file is C:\text\abc.txt or,
 
 How is such code executed, then? It appears severely broken. No 
 compiler can tell from this code fragment which is supposed to be 
 which, since \100 is a legitimate filespec in Windows.

This is not a code.  Assume this is a message for human's reading.


 Fixing the source code at the source is a lot cleaner than inflicting 
 your fix on the rest of the world. It's as bad as Oracle's attempt 
 to define a standard for its variant UTF-8 (CESU-8, which apparently 
 should be pronounced 'sezyu' in English). Their stated reason is the 
 same, that it's too much work to fix all of their databases, and 
 their cure is to lay even more work off on the rest of the world.

At first, this problem affect not only source codes but also
many texts of end users.  You can easily imagine text files
of end users contain many \ as currency sign AND many \
as a element of file names.  Even if you may success to persuade
every Japanese Windows programmers to modify their source codes,
you won't be successful to persuade Japanese business users to
modify their files like accounts.xls .

In case of Oracle, the problem was limited in the _internal_
encoding of the database (which end users don't care) and the
end users can be free from feeling any trouble, if Oracle does
a good work.  And more, conversion from CESU-8 to correct UTF-8
can be done using simple algorithm.  On the other hand, the
meaning of \ depends on context and, ultimately, only the
writer of the \ knows whether it should be U+005C or U+00A5.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-11 Thread Tomohiro KUBOTA

Hi,

At Fri, 11 Jan 2002 22:19:40 -0500,
Glenn Maynard wrote:

 You have to assume that most Japanese systems will display \ as a Yen symbol,
 because they wlil.

Japanese Windows system always displays \ (0x5c) (in CP932,
or, almost people call this as Shift JIS) and U+005C with
Yen Symbol.  However, most Linux/BSD/UNIX systems display
\ (0x5c) (in EUC-JP, which is the most popular encoding for
Linux/BSD/UNIX system) and U+005C in backslash even in Japan.



 Now, translation tables for CP932 on these systems could translate
 backslash and the yen symbol both to the yen symbol;

What is both?  I think you are talking about both of backslash and
yen symbol.  However, what do you think is the codepoints for them
in CP932?  Answer: CP932 has the following yen sign and backslash


  CP932 (Shift JIS)Unicode (mapped by CP932 table)
  --   ---
  0x5C (yen sign)  U+005C (yen sign glyph in Windows)
  0x81 0x5F (fullwidth backslash)  U+FF3C (fullwidth backslash)
  0x81 0x8F (fullwidth yen sign)   U+FFE5 (fullwidth yen sign)


note that CP932 0x5C (yen sign) is derived from JIS X 0201 and
CP932 0x81 0x5F and CP932 0x81 0x8F are derived from JIS X 0208.

thus, if you modify CP932 table 0x5C - U+00A5, it doesn't mean
breaking round-trip compatibility with CP932.

In case of Ogg, I think this can be a solution, because the
strings are never parsed as filenames.  However, this cannot
be a general solution.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-10 Thread Tomohiro KUBOTA
Hi,

For reference of glyph, I am using 
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=f9b1
and so on.  Otherwise, displayed glyphs depend on system and
we cannot discuss about same glyph.

At 9 Jan 2002 23:52:49 -0800,
H. Peter Anvin [EMAIL PROTECTED] wrote:

 My wife's name is Suzi (Susan).  Since it happens to phoneticize
 pretty poorly into Japanese, she has chosen to use the same Suzuran
 ("lily of the valley") in Japanese rather than spelling her name in
 Katakana.  "Suzuran" is U+9234 U+862D (.ANiˆN4Nh˜N-); however, I could
 personally not have told the reference glyph for U+9324 was the same
 character.  I actually found a "compatibility form", U+F9B1 (NoN&N1)
 which looks a lot more like I thought the character should look like,
 but that one is apparently only supposed to be used for Korean.

I feel U+FB91 is a glyph for printing.  Japanese people use U+9234
for handwriting and we can read it.  However, we never use it for
printing and I feel U+9234 in printing is somewhat funny.

Please refer U+F9A8 vs U+4EE4 for clearer image.

There are a few such exceptional cases.  For example, U+8A00.
The top element is written as "dot" in the image.  However, we
use "vertical stroke" for handwriting and "horizonal stroke"
for printing.  We never use "dot".  (I could not find image
for them.)

Image for U+5165 is also like handwriting.  I could not find
image for printing glyph.

Thus, I cannot say which is "Japanese", U+9234 or U+FB91.
Average Japanese people (who don't know Chinese or Korean)
don't think that the difference between U+9234 and U+FB91
is related to Chinese, Japanese, and Korean.  Fonts of my
system is like U+FB91.

I think there are a few more examples.  It is difficult to
show "all" examples, like it is difficult for a native English
speaker to show "all" verbs (s)he knows.  It is also difficult
even for me to list "all" irregular English verbs (like
go-went-gone and come-came-come).  However, I feel the number
of examples would be very small.


Note that "Kyokasho-tai" (textbook typeface) is designed to be
similar to handwriting but this typeface is rarely used other
than Japanese textbooks for elementary school.


 Interestingly, at least on my system U+9234 is displayed in the
 Japanese glyph rather than the reference glyph.

My system also shows both of U+9234 and U+FB91 like U+FB91 image.

---
$B5WJ]EDCR9-(B Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: Unicode, character ambiguities

2002-01-10 Thread Tomohiro KUBOTA

At Thu, 10 Jan 2002 01:06:22 -0500,
Glenn Maynard wrote:

 How major a problem is this in practice, right now?
 
 One temporary solution I could suggest is having specs (in this case,
 Ogg tags) choose a specific vendor's translation tables for these, and
 saying until Unicode standardizes these tables, use these, not your
 system's.  That would at least (try to) guarantee that until that
 happens, if a user enters text on one system in SJIS, and moves it to
 another via UTF-8, he'll get the same SJIS output.

I think it is a good idea.  I'd like you to request Unicode Consortium
to follow your idea.  However, the problem is, Unicode Consortium doesn't
have enough political power to define one standardized table and it
doesn't have will to release one authorized mapping table.

Do you think venders like MS, Sun, IBM, Apple, and so on (all of them
are members of Unicode Consortium) will throw away their private mapping
tables and follow a common one, though it means these venders will lose
compatibility to their previous products?  It is almost impossible.

However, I think such venders' interests are against users' interests.
Thus I want many people to send mails to request one standard mapping
table.

There may be a possibility that some one private table will be popular
enough to be a de-facto standard.  I imagine many venders are thinking
about their own private table will win a status of de-facto standard.
Though I don't like MS private table (CP932) because it has much more
differences to other tables, I will welcome it if it can finish this
confusing situation.  See a chapter of Conversion tables
differ between venders in
http://www.debian.or.jp/~kubota/unicode-symbols.html
for detail.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-10 Thread Tomohiro KUBOTA

Hi,

At 10 Jan 2002 10:02:21 -0800,
H. Peter Anvin [EMAIL PROTECTED] wrote:

  I think there are a few more examples.  It is difficult to
  show all examples, like it is difficult for a native English
  speaker to show all verbs (s)he knows.  It is also difficult
  even for me to list all irregular English verbs (like
  go-went-gone and come-came-come).  However, I feel the number
  of examples would be very small.
  
 
 If so, that would imply the number of code points would also be very
 small, and that it wouldn't be a major loss to assign code points to
 them.  Would you agree?

No, I had to add that one example may mean hundreds of characters,
because one radical may be shared by hundreds of characters.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-09 Thread Tomohiro KUBOTA

Hi,

At Wed, 9 Jan 2002 11:59:12 -0500 (EST),
Henry Spencer wrote:

 Indeed so.  But you are also an insider with strong opinions on the
 matter, and that will influence your reporting, no matter how hard you try
 to be impartial.  (Even experimenters systematically recording data tend
 to make errors favoring their own beliefs, perhaps because they are more
 careful when recording favorable results.  This is why medical
 experiments nowadays always use double blind procedures, in which the
 experimenter himself does not know which patients are getting which
 treatment until afterward.)

So, do you mean I am not free from such a bias while you are free?
Did the Japanese scholar who prepared Han Unification say that
Japanese people can read Chinese or Korean glyph?  Did (s)he say
that his/her theory is widely accepted by common Japanese people?

Yes, I think my opinion is not located at the average Japanese.
I am rather a Unicode lover than average Japanese people.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-09 Thread Tomohiro KUBOTA

Hi,

At Wed, 9 Jan 2002 17:26:47 -0500 (EST),
Henry Spencer wrote:

 I have no bias on the subject mostly because I have no opinion on the 
 subject. :-)  I don't claim to know what the general opinion in Japan
 about Unicode or Han unification is (or would be).

Sophism.  For example, you may be interested in Unicode and you may hope
Unicode to be popular as soon as possible.  You may not care about native
Japanese speakers' interest.

How do you suspect about my opinion?  I said that I hopes Unicode to
be usable for native Japanese people.  I sometimes criticise Unicode
because I hope Unicode to be more useful.  I don't criticise Unicode
because of hate for Unicode.  What's wrong about this position?
What bias do you think?

For myself, I graduated a University, which may mean my knowledge
on Kanji characters is above the avearge Japanese people.  Thus,
I may be biased that I know some more Kanji characters than average
Japanese people.  However, my job is not related to computer, publication,
typesetting, nor literature.  My knowledge on Kanji may be lower than
people with such jobs.

I did introduced all which may bias my opinion or feelings.  And you?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-09 Thread Tomohiro KUBOTA

Hi,

At Wed, 9 Jan 2002 16:00:27 +0100,
Pablo Saratxaga wrote:

  Not true.  I am a native Japanese speaker.  There are some characters
  whose Japanese version is very basic (and elementary school student
  can read) while I cannot read Chinese version.
 
 But are those unified?
 Have you an example of a unified one in such case?

Yes, unified.  The most famous example is U+76F4.
I'd like to show an image but I cannot find

Images are not available at:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76f4

There are many characters which have walk radical.  Japanese
walk radical has (usually) one dot (note that dot is a
term of Japanese calligraphy to tell element of Kanji characters,
like vertical stroke, horizonal stroke, right brush,
left brush, and leap, though my English translation may
be wrong) while traditional-Chinese and Korean walk radical
has two dots.  Since there are a little number of Japanese
characters which use walk radical with two dots (i.e.,
in Japan, some characters has two dots and many others has
one dot), the number of dots is important for Japanese.
However, they are unified.  (In this case, Japanese people
kan reed iT, just giv!ng a phunNy empResion.)

(U+2ECC ... walk radical with one dot, U+2ECD ... walk radical
with two dots.  There are another variant of U+2ECE which is not
used in Japan.  Though these radicals are not unified, characters
with these walk variants are unified.)

There are many radicals which have similar problems.


  Many older Japanese people can read traditional-Chinese style, because
  Japanese people used to use the style until about 1950.
 
 But those are not unified.
 Those that have two different codepoints in japanese encodings are two
 different ones in unicode too.

Since usage of t-C style characters is exceptional in modern Japanese
(in case of person's family names and place names, and few others),
not many t-C style characters are encoded in Japanese character sets.


 Maybe I'm missing something and there are indeed some characters that
 are problematic; however I haven't encountered none. On the other side
 I agree that my knowledge of kanji must be far below yours, so maybe I
 just happen to not know the ones that are problematic (among others I don't
 know either, of course).
 
  Believe me, I read tens of ads every day (on TV and on newspapers)
  because I live in Japan.  (Sometimes Japanese ads may use very difficult
  character which nobody can read.  The purpose is just to give an
  authorized or intelligent impression.)
 
 I once saw a picture of an ad that have the kanji for buy (I think, don't
 recall exactly) with its shell radical replaced by a real picture of
 a real shell; if it wasn't told in the text below the image what it was
 supposed to be I wouldn't had discovered it, for sure.

Real shell image for shell radical?  It is just an art, not
a character, at least for any levels of computer text processing.
Of course it is not traditional-Chinese style.


---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Unicode 3.2 Beta

2002-01-08 Thread Tomohiro KUBOTA

Hi,

Unicode 3.2 Beta is now under public comment period.
http://www.unicode.org/versions/beta.html

It has Variation Selectors from U+FE00 to U+FE0F.
However, the list of variations, i.e., StandardizedVariants.html
is not available now.

Does someone know the detail of this?  I'd like to know whether
Variation Selectors can be used for CJK Han Variants.
(I sent a mail to [EMAIL PROTECTED] a few days ago but
I have not received a reply yet.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Unicode, character ambiguities

2002-01-08 Thread Tomohiro KUBOTA

Hi,

I am a native Japanese speaker and I think I am somewhat Unicode
lover compared with Japanese average.

At Tue, 8 Jan 2002 23:03:35 -0500,
Glenn Maynard wrote:

 What, exactly, needs to be done by an application (or rather, its data
 formats) to accomodate CJK in Unicode (and other languages with similar
 ambiguities)?

The most well-known criticism against Unicode is that it unified
Han Ideograms (Kanji) from Chinese, Japanese, and Korean Han
Ideograms (Kanji) with similar shape and origin, though they
are different characters.  Even native CJK speakers and CJK
scholars can have different opinions on a question that this Kanji
and that Kanji are different characters, or same characters with
different shapes?  Since Unicode takes an opinion which is
different from most of common Japanese people, Japanese people
came to generally hate Unicode.  It is natural that scholars have
variety of opinions than common people and Unicode Consortium
did find a native Japanese scholar who support Unicode's opinion.
But the opinion is different from common Japanese people's
Thus, Japanese people think Unicode cannot distinguish different
characters from China, Japan, and Korea.  Unicode's view is that
these characters are the same characters with different shale (glyph),
so it should share one codepoint, because Unicode is a _character_
code, not a _glyph_ code.  This is Han Unification.  Now nobody
can stand against the political and commercial power of Unicode
and Japanese people feel helpless  

Note that I heard that Chinese and Korean people have different
opinion on Kanji from Japanese.  They think Kanji from China,
Japan, and Korea are same character with different shape
and they accept Unicode.

If your software support only one language in one time, you can
use Unicode and the problem is only to choose proper font.
Here, Japanese font means a font which has Japanese glyph
(in Unicode's view) for Han Unification codepoints.  Now, the
problem is to use Japanese font for Japanese, Chinese font
for Chinese, and Korean font for Korean.

However, if your software supports multilingual text, the problem
can be difficult.  Japanese people want to distinguish unified
Kanji.  However, many (even Japanese) people are satisfied if
Japanese text is written in Japanese font.  Thus, an easy
compromise is to use Japanese font for all Han Unification
characters.  (Chinese and Korean people will accept it).


I think the Han Unification problem can be ignored for daily
usage, by using the compromise I wrote above.


 Is knowing the language enough?  (For example, is it enough in HTML to
 write UTF-8 and use the LANG tag?)
 
 Is it generally important or useful to be able to change language mid-
 sentence?  (It's much simpler to store a single language for a whole data
 element, and it's much easier to render.)

Of course if your software can have language information it is
great.  mid-sentence language support is excellent!  Usage of
Japanese font anywhere (I wrote above) is a _compromise_ , so
it is always welcome to avoid the compromise.

However, I prefer more and more percentage of softwares in the
world come to be able to handle CJK characters as soon as possible,
than waiting for perfect CJK support.

There are a few ways to store language information.  Language tags
above U+E, mark-up languages like XML, and so on.  I wonder
whether Variation Selectors in Unicode 3.2 Beta 
http://www.unicode.org/versions/beta.html
can be used for this purpose or not  Does anyone have information?



Saying about round-trip compatibility, yes, round-trip compatibility
for EUC-JP, EUC-KR, Big5, GB2312, GBK are guaranteed, i.e., Unicode
is a superset of these encodings (character sets).  However,
(1) there are no authorative mapping tables between these encodings
and Unicode and there are various private mapping tables.  This
can cause portability problem around round-trap compatibility.
(2) Unicode is _not_ a superset of the combination of these encodings,
i.e., Unicode is _not_ a superset of ISO-2022-JP-2 and so on.
For (1), I am now trying to let Unicode Consortium to take some
solution or to write an attention or techinical report about this
problem.  I hear that Unicode Technical Committee is now discussing
about this problem.
For (2), no solution can exist, because Unicode and ISO-2022 has
different opinion of what is identity of character.  However, usage
of language-tags or variation-selectors(?) can partly solve this
problem.  However, an authorative way to express distinction between
CJK Kanji must be determined, and everyone must follow the way, to
keep portability.  Now I hear nobody is wrestling with this problem...
authorative is rather a political problem than technical


Note that the internal encoding may be Unicode, but stream I/O
encoding has to be specified by LC_CTYPE locale.  This is mandatory
for internationalized softwares.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http

Re: A nl_langinfo(CODESET) emulator for FreeBSD and other legacy platforms

2001-12-26 Thread Tomohiro KUBOTA

Hi,

At Wed, 26 Dec 2001 19:29:48 +,
Markus Kuhn wrote:

 Simply ship your software with a little nl_langinfo() emulation that
 fixes that problem until the FreeBSD people get they act together and
 finally implement it. It can't take that much longer any more.

A good work.  Bruno's libcharset is also available for this purpose.
It is a good idea to write the function as an emulation.


   http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.c
   http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.h

Debian GNU/Linux locales package includes various pairs of locale
and encoding.  You may want to include them.  Especially, TIS-620
for th would be needed.  (If you want, I can send you the file.)

And, a suggestion.  If LANG (or LC_CTYPE or LC_ALL) has .encoding
part, it should be checked first.  (Now langinfo.c checks utf and
8859- only.  Chinese may use GBK or GB18030 and Hong Kong people
may use Big5HKSCS.

Some people may use alias names of locale, such as german for de_DE
and french for fr_FR.  Are there any way to manage these cases?

I have ever heard that the default encoding for Japanese locale on
some proprietary Unix is Shift_JIS, not EUC-JP.  However, I don't know
the detail and I cannot suggest a concrete sample implementation.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Emacs and UTf8 locale

2001-12-25 Thread Tomohiro KUBOTA

Hi,

At Mon, 17 Dec 2001 21:02:03 +0100 (MET),
Oliver Doepner wrote:

  Also, what exactly does Emacs do to use it?
 
 It sets the language environment to utf-8, and sets the default and 
 preferred coding systems to utf-8.  It also sets the default input 
 method.

Sorry for replying old discussion.  I think UTF-8 mode should not
mean the default input method.

UTF-8 mode should only mean that the default input encoding is UTF-8
(Since Emacs has encoding guessing and fallback mechanism, Emacs can
fall into other encodings if the encoding of input file cannot be UTF-8.
The fallback encodings can be locale-dependent.) and the default output
encoding is UTF-8.  Input method depends on language, not encoding.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Two questions about console utf8 support

2001-12-25 Thread Tomohiro KUBOTA

Hi,

At Sat, 22 Dec 2001 21:50:31 -0800 (PST),
James Simmons wrote:

   http://linuxconsole.sourceforge.net
 
 Hi folks. I'm that person that is rewriting the console system.

Interesting.  Though there were a few projects such as KON for
Japanese, HAN for Korean, and JFBTERM for ISO-2022-based i18n,
none of them has planed to be integrated into Linux source code.
(offtopic: the reason of this problem is sometimes that skilled
Japanese developers are sometimes not good at English language.)

I have one request, though I am not very familiar with this area.
You know, east Asian languages use thousands of characters and
we need conversion engine to input our languages.  For X Window
System, we have a standard protocol which is called XIM.  However,
there are no such standards for console.  Are your project
planning to supply some API or interface for this purpose?
East Asian people will be more happier if the API is standardized
and we can use same conversion engine for all of Linux, BSD, and
other UNIX-like systems.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Emacs and UTF-8 locale

2001-12-18 Thread Tomohiro KUBOTA

Hi,

At Tue, 18 Dec 2001 15:38:19 +0200 (IST),
Eli Zaretskii wrote:

utf8_mode = (strcmp(nl_langinfo(CODESET), UTF-8) == 0);
 
 Thanks.  This is something that should be added to Emacs.  For now, Emacs 
 implements the backup procedure, which is the Lisp equivalent of the 
 following:
 
   char *s;
int utf8_mode = 0;
  
if ((s = getenv(LC_ALL)) ||
(s = getenv(LC_CTYPE)) ||
(s = getenv(LANG))) {
  if (strstr(s, UTF-8))
utf8_mode = 1;
}
  
  It is important that you do not only test LANG, but the first variable
  in the sequence LC_ALL, LC_CTYPE and LANG that has a value.
 
 That is what Emacs does.

Why limiting to UTF-8?  Since LC_CTYPE locale is widely used not
only for UTF-8 encodings but also for various encodings, and since
GNU Emacs supports such various encodings, I think it is a good idea
to use LC_CTYPE locale not only for detecting UTF-8 mode but also
for detecting other encodings such as ISO-8859-*, KOI8-*, EUC-*,
TIS-620, Big5, and so on.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Yudit and XIM

2001-12-16 Thread Tomohiro KUBOTA

Hi,

At Fri, 14 Dec 2001 18:45:13 +0100,
Juliusz Chroboczek wrote:

 XIM is intrinsically a locale-dependent protocol, as the set of
 available input methods is locale-dependent.  Thus, the IM must be
 opened in the Input Method's locale.

Right.

 On the other hand, once the IM has been opened, its usage is fully
 locale-independent; conversion from the IM's codeset to UTF-8 is done
 internally by Xlib.

If you use Xutf8LookupString(), usage is locale-independent.  (Instead,
it will dependent on a specific encoding of UTF-8.)  If you use
XmbLookupString(), it will still give strings in encoding of the
locale when you opened the IM.

 In practice, what this means is that the user must set her locale
 according to the IM's she wishes to use.  The programmer, on the other
 hand, does not need to bother with locale issues.

For standard softwares, this is right. (Standard means that the software
supplies a standard way for users to choose input methods.)  On the
other hand, Yudit doesn't follow the standard way (users must configure
Yudit) and it should choose IM from its menu.  It is Yudit's design.
To keep consistency with the design, Yudit should be able to choose
input methods from the menu.  Concretely speaking, Yudit should have
skkinput, kinput2, xcin (traditional Chinese), xcin (simplified
Chinese), ami, htt, and so on.  Of course I think the _initial_ 
input method can be chosen by following the standard configuration for
all XIM-supported softwares.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Yudit and XIM

2001-12-13 Thread Tomohiro KUBOTA

Hi,

At Thu, 13 Dec 2001 19:39:15 +0100,
Juliusz Chroboczek wrote:

 Bruno has added full support for locale-independent use of XIM in
 XFree86 4.1.0 Xlib.  The 4.1.0 version has some bugs, for reliable
 support you will want to use the Debian patched version or 4.1.99.2 or
 later (current CVS should be fine).
 
 For more information, man Xutf8LookupString(3)
 
   http://www.xfree86.org/current/Xutf8LookupString.3.html
 
 or see the function Input() in input.c in a reasonably recent version
 of XTerm.

I think the introduction of Xutf8LookupString() is not sufficient
for XIM to be locale-independent.  For OverTheSpot preedit type,
the XIM client has to prepare an XFontSet so that the XIM server
uses it for displaying preedit strings.  This font (fontset) to
be used for displaying preedit strings _must_ be prepared by 
client side to keep consistent proportion between already-inputed
strings and preedit strings.  (The aim of OverTheSpot preedit type
is to make users feel as if the preedit string is displayed
seamlessly.)

I suggest one solution.  I am very very sure that people who want
to use XIM know about locale and have the proper locale for using
the XIM.  (I don't understand why Gaspar doesn't want to introduce
locale-dependent features.  Introduction of such features does not
mean a reduction of usability for people who use OSes which don't
support locale.)

Thus, as you prepared kinput2 as a menu item for input, how about
preparing menu items for popular XIM servers?  The database of XIM 
servers (inside the Yudit) also has the proper locale for each XIM
server and setlocale(LC_CTYPE,proper_locale) will be called when
a user chooses an XIM for input.  The list should be customizable
by users because we can never know a complete list of all XIM servers
in the world.  Please test mlterm (http://mlterm.sourceforge.net)
which can dynamically change XIM servers by using this method.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: StarWars

2001-12-13 Thread Tomohiro KUBOTA

Hi,

At Thu, 13 Dec 2001 15:24:09 +,
Markus Kuhn wrote:

 If you like VT100 terminals, I'm sure you will enjoy this
 
   telnet towel.blinkenlights.nl

I was refused to connect this site...

 Anyone up to make a UTF-8 version of this? :)
 
 http://www.asciimation.co.nz/

Interesting.  However, this is already UTF-8, though only a small
subset of U+0020 - U+007e seems to be used. :-)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: diacritics in xterm

2001-12-12 Thread Tomohiro KUBOTA

Hi,

At Tue, 11 Dec 2001 21:47:49 +0100,
Radovan Garabik wrote:

  : Thank you for the hint. So does this mean, the problem hasn't been
  : fixed for two years and you recommend the dangerous fix by replacing the
  : xterm binary?
 
 It seems so.
 I am running the dangerous binary for about 4 months, in both UTF-8
 and ISO-8859-2 locales and so far have no problems at all.

Please be sure that the fix does not disable multibyte character input.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: input method for Japanese

2001-12-09 Thread Tomohiro KUBOTA

Hi,

At Sun, 9 Dec 2001 12:27:44 +0100 (CET),
Gernot Jander wrote:

 I have some applications for reading, editing and learning Japanese
 which are until now based on the kinput2/canna input method. As far as
 i can see, this method is bound to the EUC encoding.
 Is there any other input method known, that uses utf-8 and works with
 the ja_JP.utf-8 locale? Or is any work in progress for Japanese input
 with utf-8 which i can join?

kinput2 supports both of kinput2 and XIM protocols.  (It also supports
a few other protocols).  Note that kinput(2) protocol is developed
before the standardization of X Window System internationalization
and is now obsolete.

When you are using XIM protocol (X11R6's standard), you can input
Japanese character using kinput2 into softwares which are running
under ja_JP.UTF-8 locale.  For example, you can input Japanese into
xterm under UTF-8 locale using kinput2.  Thus, we don't need to
develop UTF-8-based Japanese input methods.


OffTopic:

I also want Yudit to adopt XIM protocol instead of kinput2 protocol.
There are a few Japanese input method softwares such as kinput2,
skkinput, xwnmo, and so on.  (Note that I am not saying about the
backend conversion engine).  All of them support XIM protocol while
only kinput2 supports kinput2 protocol.  Moreover, there are Korean
and Chinese XIM servers such as Ami and XCIN.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




mlterm mailing list is now opened

2001-11-29 Thread Tomohiro KUBOTA

Hi, everyone.

mlterm (MultiLingual TERMinal emulator), which I introduced a few days
ago in i18n@xfree86 and debian-i18n lists, has got a SourceForge hosting.

http://www.sourceforge.net/projects/mlterm/
http://mlterm.sourceforge.net/

mlterm is a terminal emulator with following unique features:
 - various encodings are supported (multilingual)
 - combining characters (TIS-620, TCVN5712, JIS X 0213, and UTF-8)
 - anti-aliased fonts with Xft and True Type fonts.
 - multiple windows in one process
 - XIM is changeable dynamically in run-time and you can input
   multiple complex languages such as Japanese and Chinese.
 - scroll by wheel mouse
 - background image (in other words, wallpaper)
 - transparent background
 - scrollbar plugin API (unstable)

Two mailing lists are now available, one for discussion in English,
the another for discussion in Japanese language.  I imagine some of
you will be interested in joining the English mailing list.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: /efont/ and xterm (Re: UTF-8 Terminals)

2001-11-16 Thread Tomohiro KUBOTA
Hi,

At Wed, 14 Nov 2001 02:12:19 +0100 (CET),
Markus Kuhn wrote:

 xterm is not suited for proportional or bi-width fonts. Split the font
 up into a 8x16/16x16 pair, and there will be no problems. Just like
 you have to do with Unifont.

I'd like to know XTerm's policy.
What is the reason of the (un)support of biwidth fonts like GNU Unifont
and /efont/ ?
Is it a policy of XTerm?  Or, they will be supported in future?
Otherwise, willing to accept patches to support them?

I have no strong opinion on how biwidth (or doublewidth) fonts
should be assembled.  XFree86's doublewidth fonts don't contain
singlewidth glyphs and they are exactly fixed width, while GNU
Unifont and /efont/ contain both singlewidth and doublewidth
glyphs.  I don't know which is better.  I even have no idea
whether they should follow one united policy or not.

However, it will benefit users if XTerm will support GNU Unifont
and /efont/ as is.  If a patch with tens of lines for XTerm can
save time of millions of users, it is absolutely worth doing.

If nobody is working on XTerm's support of GNU Unifont and /efont/,
I'd like to research.  Can anyone teach me where should I start to
read the code of XTerm?

---
$B5WJ]EDCR9-(B Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: UTF-8 Terminals

2001-11-12 Thread Tomohiro KUBOTA

Hi,

At Sat, 10 Nov 2001 16:19:21 +,
Markus Kuhn wrote:

 Hardly anyone needs full Unicode. If all you are interested in are
 European scripts and symbols for instance, then the 3 kilocharacters of
 the Unicode subset MES-3 are more than good enough for your needs, and
 the XFree86 standard xterm fonts 6x13, 8x13, 9x15, 9x18, 10x20 have
 covered MES-3 for over a year now and are widely used.

It is true that hardly anyonw needs full Unicode.  However, it is
different from people to people which subset of Unicode they need.
For example, as you said, MES-3 would be a good subset for European
people.  People from other countries needs other subsets.

Since XFree86 is a single distribution for the whole world, it should
satisfy needs for people all over the world.


 People who can read CJK glyphs have used larger font sizes so far and
 will continue to do so in the future. 

True.  Japanese people like 7x14 + 14x14 fonts and Korean and Chinese
people like 8x16 + 16x16 fonts.

XTerm has used 6x13 font as default (because fixed font was 6x13).
Thus, it is reasonable way to have 12x13 font so that XTerm with the
default setting can display as many characters as possible (including
CJK scripts).  I think it is not too small for CJK glyphs because
there are small (of course not so beautiful) fonts, for example
10x10 and 12x12, for Japanese.

BTW, did you now /efont/ project
  http://openlab.ring.gr.jp/efont/index.html
  http://openlab.ring.gr.jp/efont/unicode/index.html
which has 10, 12, 14, 16, and 24 pixels Unicode fonts ?
The web page has a table of subsets these fonts cover.
Though I am not taking part in the project, I hope thse fonts
will be used widely like ETL intlfonts.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n]Call for testers: luit in XFree86 CVS

2001-11-12 Thread Tomohiro KUBOTA
Hi,

At Tue, 13 Nov 2001 13:28:42 +1100 (EST),
Jim Breen wrote:

 I think we can get into serious hair-splitting here. My copy of JIS X 0213
 describes itself as "$B3HD%4A;z=89g(B" (enlargement or extension kanji set), 
 and the text inside makes it pretty clear that it it is in addition to 
 JIS X 0208. I noted the new "JIS Kanji Dictionary"  of which I saw some
 proofs in Tokyo earlier this year is described as covering JIS X 0208 
 and JIS X 0213. (Poor old JIS X 0212 is forgotten.)

It is clear that JIS X 0213 includes JIS X 0208 (except for "dis-unified"
characters).

http://www.asahi-net.or.jp/~wq6k-yn/code/enc-x0213.html
http://www.watch.impress.co.jp/internet/www/column/ogata/index.htm
http://www.jca.apc.org/~earthian/aozora/0213.html
http://www.itscj.ipsj.or.jp/ISO-IR/index.html


 I think there were a total of 56 kanji "dis-unified" in this way.

Sorry, "kuchi-taka" and "hashigo-taka" is not "dis-unified".


 Certainly if you set out to use JIS X 0213 you really have to  run with a 
 a single set combining the characters defined in both JIS X 0208 and 
 JIS X 0213, which is what the existing  font files do. 

No.  Though JIS X 0213 is an extension to JIS X 0208, JIS X 0213 itself
includes all JIS X 0208 characters.  Thus, JIS X 0213 is intended to
be a replacement of JIS X 0208.

Please check the literatures above for detail.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
"Introduction to I18N"  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: [I18n]Call for testers: luit in XFree86 CVS

2001-11-12 Thread Tomohiro KUBOTA

Hi,

At Tue, 13 Nov 2001 15:59:14 +1100 (EST),
Jim Breen wrote:

 Where it says:
 
  JISX 0213 
Japanese national standard. Released recently. Intended to be used
in addition to JISX 0208. Share many characters with JISX 0212. 
  
 And the author?
 
  12 November 2001
  Tomohiro KUBOTA [EMAIL PROTECTED] 

Oh, sorry!  This is a mistake.
(The last modification on 12 November 2001 was related to the change
of unicode charts site.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: [I18n]Call for testers: luit in XFree86 CVS

2001-11-12 Thread Tomohiro KUBOTA

Hi,

At 12 Nov 2001 18:56:10 +,
Juliusz Chroboczek wrote:

 I don't want to extend luit for 4.2.0; bug fixes only in this version.
 Much of what you're proposing will go into future releases of luit.
 More precisely,

I see.  Let's discuss these points after the release of 4.2.0.

BTW, I have now trouble compiling luit.  charset.c includes
X11/fonts/fontenc.h and I could not find it.  I found it in
xc/lib/font/include/fontenc.h .  Is it the right file?

When I proceed compilation, I met compilation errors of:

charset.o: In function `FontencCharsetRecode':
charset.o(.text+0x146): undefined reference to `FontEncRecode'
charset.o: In function `getFontencCharset':
charset.o(.text+0x2f0): undefined reference to `FontEncMapFind'
charset.o(.text+0x302): undefined reference to `FontMapReverse'

I think I need some libraries in XFree86 CVS tree...


 TK How about Johab?
 
 Don't know.  We'll see.

Johab is a Korean encoding which covers full hanguls and symbols and
ideographs in KS X 1001.  However, the codepoints are not compatible
with EUC-KR.

 As I've already mentioned, I strongly dislike the complexity of
 Markus' proposal.  I want to use single shifts only.

Thus I said CSI 1 w for each character.

 TK but I am afraid this solution can be too heavy, because luit will
 TK have to issue CSI 1 w for each doublewidth character and XTerm
 TK will have to parse it.
 
 I don't think that will be much of a problem.  If it is, we'll see
 what can be done.

Sure.  If we will single shifts only, we can have more simple
sequence.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: UTF8 Terminal Detection

2001-11-12 Thread Tomohiro KUBOTA

Hi,

At Mon, 12 Nov 2001 23:24:20 +0100 (CET),
Markus Kuhn wrote:

 I don't think, this is feasible or useful. Environment variables can only
 be set by a parent process for its children. In the case of a pty terminal
 emulator that starts applications as child processes (e.g., xterm), we
 have already the locale variables providing the encoding information to
 both the terminal emulator (e.g., xterm) and its children (shell,
 applications). In other connections, terminal and applications are just
 connected by some byte-serial communications channel that doesn't transmit
 environment variables. Modifying all communications channels to do that is
 further of then using UTF-8 everywhere, so why bother?

I have been using ~/.bashrc including the following line for long time.

if [ $TERM = linux -o ${TERM%-*} = xterm ]
then
  LANG=C
else
  LANG=ja_JP.eucJP
fi

This works for terminals which I usually use
 - terminals without Japanese (EUC-JP) support
   Linux console, Linux framebuffer console, and xterm
 - terminals with Japanese support
   kon console, jfbterm console, rxvt compiled with Kanji support,
   kterm, Tera Term Pro, and shell mode in emacs on X11

For terminals which support Japanese, I'd like to set LANG=ja_JP.eucJP
so that I can use Japanese.  However, using LANG=ja_JP.eucJP in other
terminals will cause mojibake.  For example,
   http://www.debian.or.jp/~kubota/mojibake/xterm.png
Such mojibake can be avoided by setting LANG=C (English messages will
be displayed, which can be read using English-Japanese dictionary).

Because it is really bothering to set LANG or to invoke screen manually
each time I start a new terminal, I am now almost happy with the above
setting.

However, looking TERM way does not work well for every cases nor
is a right way.  Also, this way does not work for non-Japanese languages
as well as for Japanese, because TERM=kterm is available and is
widely used for Japanese-capable terminals while there are no
replacement for Korean, Chinese, Thai, nor other languages.
For example, Hanterm sets TERM=xterm.

You may think why I have to use terminals without Japanese support.
Setting LANG=ja_JP.eucJP and using Japanese-capable terminals only
would make me happy.  However, everyone have chance to use Linux 
(or BSD, ...) console.  And, many softwares invokes xterm directly.

Anyway, using TERM variable for this purpose is not reliable,
though this has been a real daily need for us for long years.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Locales and Emacs 21

2001-10-23 Thread Tomohiro KUBOTA

Hi,

At Tue, 23 Oct 2001 12:25:00 +0100 (BST),
Markus Kuhn wrote:

 Unfortunately, that doesn't work right out-of-the-box yet.
 
 Elisp has at the moment no direct way of accessing the output of
 nl_langinfo(CODESET), therefore Emacs doesn't know about the current
 locale's character set and can't consider this information when deciding
 on the character set of a loaded file. Gerd Moellmann [EMAIL PROTECTED] said
 that fixing this would already be on the post-21 todo list.

Emacs 20 already had a mechanism to guess encoding of the file
with ordered candidate list of encodings.  The problem is, we have
to configure the list.  (set-language-environment set this list.)

For example, in Japanese environment, the encoding-guesser will
check the encoding of the file with the candidate of EUC-JP,
Shift_JIS, and ISO-2022-JP.  (UTF-8 should be added to this list).

Thus, what is configured using LC_CTYPE variable should the top
candidate for the guesser, not the unique candidate.


I think terminal-coding-system should also be set from LC_CTYPE.
I heard a few months ago that Emacs21 will be able to do this.
Now Emacs21 has been released.  Did someone tested?

Also, Emacs20 in X Window System could not display non-ISO-8859-1
characters without some settings in ~/.emacs or ~/.Xresources
(these characters were displayed by white box).  This is caused
by improper default font configuration.  Is this problem fixed
in Emacs21 ?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Vim 6.0 has been released (debian info)

2001-10-04 Thread Tomohiro KUBOTA

Hi,

At Wed, 3 Oct 2001 23:24:11 -0400,
Jimmy Kaplowitz [EMAIL PROTECTED] wrote:

 That's not true on my up-to-date Debian system, running sid/unstable.
 The current release, 6.0.011-2 (which corresponds to upstream vim
 6.0.11), is compiled with multi_byte disabled. The alpha and beta
 packages had it enabled, and I hereby put in my vote for it to be
 re-enabled. Wichert, a number of us think UTF-8 support is essential to
 the system of the future. If you want a minimalist version of vim
 without UTF-8, reintroduce vim-tiny.

I confirmed I was wrong and you are right.  This is terrible situation.
Now multibyte-language speaker cannot use vim at all, neither in
legacy encoding nor in UTF-8.  Even my bug report with a patch (#107856) 
cannot fix this situation, though Wichart closed the bug when he packaged
Vim 6.0!

 Bug#107856:  http://bugs.debian.org/107856

In short, Vim 6.0 is completely useless without locale support for
CJK people, while it means that 8bit-language people merely cannot
use UTF-8 mode and they can use legacy encodings.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Vim 6.0 has been released (debian info)

2001-10-04 Thread Tomohiro KUBOTA

Hi,

At Thu, 4 Oct 2001 06:34:50 -0400 (EDT),
Thomas E. Dickey [EMAIL PROTECTED] wrote:

 want isn't the same as need

Right, 8-bit language people want UTF-8 support.  However, CJK
people need either EUC support or UTF-8 support.  (Of course
we want both.)  

Fortunately, since Vim 6.0 supports LC_CTYPE locale, it supports
both of EUC and UTF-8.  On the other hand, Vim 6.0 without UTF-8
support does not support locale too, which makes Vim 6.0 without
locale (including UTF-8) support completely useless for CJK people.

I imagine RTL-language-speaking people and combining-character-
language-speaking people also canoot live with Vim 6.0 without
locale support.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Unicode support under Linux

2001-10-03 Thread Tomohiro KUBOTA

Hi,

At Wed, 03 Oct 2001 15:45:31 -0400,
Richard, Francois M [EMAIL PROTECTED] wrote:

 But, is it also true to say that under Linux utf-8 Locales, all C functions
 handle properly char data representing utf-8 character encoded data? Do
 strlen, strchr, strcmp, strcpy, toupper process char data correctly when the
 Locale character encoding is utf-8? OR I need to use the wide character
 functions after specific conversion from char to wchar_t of my charatcer
 data?

Not perfectly.

* strlen
  strlen counts the *number of bytes* of the given string, not the
  *number of characters* of the string.  Since UTF-8 is a multibyte
  encoding, these two does not coincide.

* strcpy
  works well.

* strchr
  does not works at all, because UTF-8 character cannot be expressed
  with 'char' type.

I think the simplest way to substitute all these functions is to use
wide character.  Standard C library has wchar_t substitution of above
functions.  And, these are conversion functions between multibyte
character and wide character.  Note that multibyte character does
not mean the character is always multibyte.  It is locale-dependent
encoding.  This means that, in ISO-8859-1 locale, multibyte character
is ISO-8859-1.  In Big5 locale, multibyte character is Big5.  I.e.,
if you write your software using multibyte character and wide character,
your software will support not only UTF-8 but also all major encodings
in the world such as ISO-8859-*, EUC-*, KOI8-*, and so on.

Explanation on wchar_t functions is available at my document
available from my signature at the bottom of this mail.

Note that wchar_t is not always UTF-32, though it is always true in
GNU libc.  If you have to write portable software, you must not assume
wchar_t is UTF-32.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



EastAsianWidth revised

2001-09-08 Thread Tomohiro KUBOTA

Hi,

As you know, Unicode 3.1.1 is released.  It revised the East
Asian Width for 15 characters.

Markus, could you please update your wcwidth() implementation?
And, all softwares which adopt Markus' wcwidth() or private
wcwidth() should be updated.

Read http://www.unicode.org/ for detail.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/

-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Cross Mapping Tables (Re: EastAsianWidth revised)

2001-09-08 Thread Tomohiro KUBOTA

Hi,

At Sat, 8 Sep 2001 20:54:37 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 The following 15 characters went from neutral to ambiguous,
 probabaly someone discovered them in some CJK character set
 that is displayed there double-width:

I imagine so, though these characters are not related to
my report http://www.debian.or.jp/~kubota/unicode-symbols.html .

However, there is an another problem that Unicode Consortium
has abolished all EastAsian cross mapping tables.  I once
pointed that there are many cross mapping tables for Japanese
Shift_JIS and JIS X 0208 - Unicode.  I said that this causes
a problem that an identical document in JIS X 0208 can become
different when converted into Unicode in various environment.

Now we have lost these mapping tables.  Thus, the situation
I pointed has got even worse because now we can implement
arbitrary mapping tables because there are no standards.
I will request Unicode Consortium to supply one authorized
reliable reference mapping table between Unicode and JIS X 0208.

This problem also affects the EastAsianWidth.  Now we lost
a way to discuss which Unicode character is doublewidth in
EastAsian, except for characters only used in CJK (such as
Han Ideogram, Hiragana, Katanakan, Hangul, and CJK-only
punctuations).


 The normal wcwidth() did not change as a result of Unicode 3.1.1,
 because  both neutral and ambiguous characters result there in
 the same width: 1
 
 I just updated the still somewhat experimental wcwidth_cjk(),
 in case people found that so far actually useful. It contains
 a new table of EastAsianWidth Ambiguous characters.
 
 http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Thanks.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: How to use Unicode

2001-08-31 Thread Tomohiro KUBOTA

Hi,

At Fri, 31 Aug 2001 23:15:20 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 The -u8 was a temporary hack needed 2 years ago before glibc 2.2 with
 UTF-8 locale support was around. It is obsolete now, except on other
 operating systems (namely: FreeBSD) that still didn't have UTF-8 locales
 last time I checked.  If you set the locale, then not only xterm but also
 all processes started inside will be informed that you want UTF-8. That's
 much neater as it replaces zillions of command line options to activate a
 separate UTF-8 mode for each single tool.

True.  Once a user set LANG variable, he/she should not need any more
specification of language and encoding.  This is a (part of) idea of
locale.

XTerm started to support locale partially - only UTF-8 locales.
Further improvement will be discussed.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Backspace problem in Xterm/rxvt

2001-08-13 Thread Tomohiro KUBOTA

Hi,

AAt Tue, 14 Aug 2001 10:13:56 +1000 (EST),
Jim Breen [EMAIL PROTECTED] wrote:

 kterm's long-term practice notwithstanding, a BS should backspace over a
 whole character, and not fragment of one.

If you mean BS key on your keyboard by your word BS, I agree.

If you mean BS code (0x08) output to tty from softwares, I don't agree.
Such change of de-facto standard is just impossible.  This is not a
discussion on which is technically better.

It is not only kterm's practice but also every Japanese terminals'
and every Japanese enabled softwares' practice.  I remember you are
living in Japan now, aren't you?  Then you can try Japanese version
of MS-DOS, Tera Term, telnet included in MS-Windows, NCSA telnet,
rxvt, eterm, aterm, wterm, and so on.  I think you cannot find any
column-oriented terminal which moves cursor in two columns for one
output of 0x08 code.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Vim 6.0 now in beta test

2001-08-06 Thread Tomohiro KUBOTA

Hi,

At Mon, 06 Aug 2001 13:08:55 +0200,
Bram Moolenaar [EMAIL PROTECTED] wrote:

 I don't see this problem.  Are you using Vim in the GUI version or in a
 terminal?  Does the cursor move to the right position after a delay or when
 typing another character like f?

I am using terminal version on xterm 157 with -u8.


I input a doublecolumn character using 'a' command and so on.

[]   - a doublecolumn character
  ~  - cursor position

Then I hit ESC key

[{}  - {} means dotted box character and [ means garbage half.
 ~

I found that the following occurs after about one second:

[]}  - } means garbage half of dotted box character. [] is right character.
~

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Vim 6.0 now in beta test

2001-08-05 Thread Tomohiro KUBOTA

Hi,

At Sun, 05 Aug 2001 12:42:03 +0100,
Markus Kuhn [EMAIL PROTECTED] wrote:

 Vim 6.0, Bram Moolenaar's vi editor with full UTF-8 support has now
 moved from alpha to beta test stage, so it's supposed to be stable and
 just needs wide and thorough testing now before it gets burned on
 millions of CD-ROMs:

I tried and I found a bug.  In UTF-8 locale, When I input a doublewidth
character (for example, hiragana) at the end of a line and hit ESC key,
the cursor moves toward left by only one column.  It should be
two columns.

This bug does not occur in EUC-JP locale.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Backspace problem in Xterm/rxvt

2001-08-05 Thread Tomohiro KUBOTA

Hi,

At Mon, 06 Aug 2001 09:58:28 +0500 (IST),
[EMAIL PROTECTED] wrote:

 While using Backspace or Delete to erase character in Xterm with 
 UTF-8 support , it will not work properly. It will accept to more
 Backspace for a single character.  

This is not a responsibility of terminals but of shells.
Terminals are responsible to erase one column for one 0x08
code.  It is shells' responsibility to issue proper number
of 0x08 code for one hitting of BS key and erase proper
bytes of the internal buffer.

Many shells are designed on an assumption that numbers of
characters, bytes, and columns are identical.  This assumption
is only true for encodings without multibyte characters,
doublewidth characters, combining characters, and other
complex features.

Try patches for bash and so on which are available at:
http://oss.software.ibm.com/developer/opensource/linux/patches/i18n.php

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: New Unifont release

2001-07-11 Thread Tomohiro KUBOTA

Hi,

At Wed, 11 Jul 2001 14:18:31 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

  Can't b) be solved with the help of fontsets instead of redundantly
  doubling the number of fonts?
 
 Not in the current state of affairs. Xlib doesn't do anything
 meaningful when an XFontSet has two fonts with the same encoding
 (here: ISO10646-1). The fontset only helps when all you have are fonts
 in different character sets (ISO8859-x, JISX0208, JISX0212, etc.);
 then the DrawString algorithm will cut the string into segments, based
 on the character sets. Other information from the fonts (e.g. width)
 is not used during this segmentization.

Any possibility on future extension of X11R5 XFontSet or X11R6 XOM
to support it?  Internationalized softwares which use XFontSet
or XOM should run also under UTF-8 locales...

I think both way (Unifont and separate font) should work because
both ways exist.  Practically, separate fonts way is important
because there are less number of fonts which include large sub-
charactersets such as Ideogram.


 And for new code, we use Xft instead of XFontSet. There also, it is
 helpful to have the entire Unicode repertoire in a single font.

IMO, introducing another scheme as a recommended default way is not
a good idea.  More and more new knowledge is needed, less and less
software will be internationalized.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Arabic (was Re: [I18n]Syriac)

2001-07-05 Thread Tomohiro KUBOTA

Hi,

At Fri, 6 Jul 2001 04:30:04 +0430 (IRDT),
Roozbeh Pournader [EMAIL PROTECTED] wrote:

 We have to choose some way: go the OpenType way, or come to some
 assignment of glyph numbers somewhere (Private use area? After U+10?)
 for the missing presentation forms.

Why not submit a proposal to include them to Unicode?

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Luit and screen [was: anti-luit]

2001-07-04 Thread Tomohiro KUBOTA

Hi,

At Wed, 4 Jul 2001 20:39:30 +0100 (BST),
Robert de Bath robert$@mayday.cix.co.uk wrote:

 Oops, I just went back to the GNU site; wrong licence.
 The _X11_ licence is compatible with the GPL ...
 so what's the problem Juliusz? You won't be using GPL code from outside
 in luit so there's no 'infection'.

X11 license is compatible with the GPL.  This means X11-licensed softwares
can be used as a basis of GPL-ed softwares.  However, softwares of GNU
Project will have to be assigned to FSF.  (Note the difference between
merely GPL-ed softwares and GNU Project softwares.)  This FSF's way is
to guard itself legally.  Dual license will not help this situation.

OTOH, GPL-ed softwares cannot be included in XFree86 source tree, as
Juliusz said.

Thus, I think Juliusz's way (luit in X11 license) is reasonable.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Locking Linux console to UTF-8

2001-07-03 Thread Tomohiro KUBOTA

Hi,

At Sat, 30 Jun 2001 09:05:15 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 Do HAN, HAN2, KON, etc. already all work in UTF-8 locales?

No.  I also have never heard about development effort of it.
Nobody seems to feel needs and be interested in developing it
so far, at least for kon and jfbterm.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Emacs and nl_langinfo(CODESET)

2001-06-30 Thread Tomohiro KUBOTA

Hi,

At Sat, 30 Jun 2001 09:00:50 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 If
 you press ^C in an application that spits out BIG5 in an unfortunate
 moment or truncate a string by counting bytes, then you will loose BIG5
 synchronization, and the terminal has to skip characters in the input
 stream until is finds two G0 characters in a row to be sure again where
 the next character starts. BIG5 is an example of a rather messy encoding,
 not only in that respect. ISO 2022 is far worse.

I don't understand why the current implementation of luit
can avoid this problem while iconv() approach cannot.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Emacs and nl_langinfo(CODESET)

2001-06-30 Thread Tomohiro KUBOTA

Hi,

At Sat, 30 Jun 2001 08:48:15 +0100 (BST),
Markus Kuhn [EMAIL PROTECTED] wrote:

 I added to xterm and less long ago code that searches for the substring
 UTF-8 in LC_ALL || LC_CTYPE || LANG, long before glibc had any UTF-8
 locale and I knew about either nl_langinfo() or even libcharset. It is now
 obvious that nl_langinfo or libcharset is the proper solution to find out
 whether we should activate UTF-8 mode or not. My only agenda here is that
 I want to get rid of the necessity to remember application-specific
 command line switches such as -u8. I consider the -u8 deprecated and would
 appreciate if people wouldn't mention it any more.

Yes.  I strongly agree that we should not introduce application-specific
command line switches such as -u8.  (In Japan, there are some books which
read how to configure such softwares.  For example, you need
*international: yes line in your ~/.Xresources to use xterm with Japanese.
You need kterm instead of xterm.  Use jless instead of less.  Some of
internationalized X softwares have multibyte option to enable it.
Be careful not to specify -*-helvetica-* font for Japanese.
I also bought a few books to establish Japanese environment when I
started to use Linux.  That is a mess!  Only setting LANG should be enough.
(Who need a book to simply set LANG variable!)

Using nl_langinfo() and libcharset _only_ to detect UTF-8 locale
is, I think, too heavy.  It can be used also to detect other encodings,
including ISO-8859-*, EUC-*, KOI8-*, and so on.  Such an information
can be used to enable the encoding by calling iconv() or calling luit
from XTerm.


 Please don't try to read my mind remotely. Please use the continuously
 updated core dump of my mind at
 
   http://www.cl.cam.ac.uk/~mgk25/unicode.html
 
 instead. :-)

I read your intension also from your mails to mailing lists.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Locking Linux console to UTF-8

2001-06-29 Thread Tomohiro KUBOTA

Hi,

At Fri, 29 Jun 2001 15:58:00 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

  Personally I would suggest making this kind of user-space console
  software the default
 
 These consoles rely on the framebuffer console.

Though jfbterm rely on framebuffer (and require Linux 2.2
or later), kon does not (and works with older Linux kernel).
[According to the changelog file of kon, the first test
release was 1992-10-13, obviously when framebuffer was not
available.]

However, I don't know whether Unicode can be implemented
without framebuffer.  Just an information.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-27 Thread Tomohiro KUBOTA

Hi,

At Wed, 27 Jun 2001 20:51:31 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

 I agree that in _some_ places programs exchange text in locale
(snip all followings)

This is just I'd like to insist.

Just one addition.

Since Juliusz's filenames in UTF-8 without conversion way works
only under UTF-8 locales, it is a subset of filenames in locale
encoding way (i.e., the present state).  (Note that if you follow
filenames in locale encoding way, you will use UTF-8 filenames
in UTF-8 locales.)  Thus, this way does not include any technical
improvement but it is just a pressure to people who don't use UTF-8
locales.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At Tue, 26 Jun 2001 22:11:06 +0200 (CEST),
Bruno Haible [EMAIL PROTECTED] wrote:

  - Newbies should have only a single variable to set in their
$HOME/.profile, not dozens.

Yes.  This is the point.  When users set LANG vairable, they
expect all softwares to obey the variable.


  - We want to make it easy for everyone to use an UTF-8 locale.
Users shouldn't be bothered to change various $HOME/.* files,
set .Xdefault resources etc.

Yes.  However, not only UTF-8 but also all other encodings.


  - All X programs which set their default font to *-iso8859-1
independently of the locale. This includes nedit.

Of course such softwares are buggy.  However, softwares
which use XDraw{Image}String() are also buggy.  (Softwares
before X11R4 should use both XDraw{Image}String() and
XDraw{Image}String16().  Modern softwares after X11R5
should use X{mb,wc,(utf8?)}Draw{Image}String().)

And more, default font of -adobe-helvetica-* is buggy enough.
This excludes most non-Latin fonts.  -adobe-helvetica-*,* is
good.  Or, adding-,*-mechanism before XCreateFontSet() is
better, like I modified twm.

in xc/programs/twm/util.c
basename2 = (char *)malloc(strlen(font-name) + 3);
if (basename2) sprintf(basename2, %s,*, font-name);
else basename2 = font-name;
if( (font-fontset = XCreateFontSet(dpy, basename2,
missing_charset_list_return,
missing_charset_count_return,
def_string_return)) == NULL) {

Of course we can implement better font-guessing mechanism, like
I implemented for IceWM, Blackbox, and Sawfish.  (I didn't use the
mechanism for twm because I thought the mechanism is too heavy for
twm.)

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At 26 Jun 2001 13:49:10 -0700,
H. Peter Anvin [EMAIL PROTECTED] wrote:

 Incidentally, I believe there needs to be an easy way to set the
 default character set in use on a system.  This may of course be
 overridden by the user (possibly at their own peril), but it is
 nevertheless a useful concept.

This mechanism is implemented since X11R5.  XFontSet.

Why XFontSet is not very popular?  I imagine some reasons.
 - People imagine from its name that it is only for CJK people
   who need multiple fonts.
 - People were accustomed to use system without setting locale.
   XFontSet-related functions assume ASCII without locale setting.

Thus, when using XFontSet, I check locale and use XFontStruct-
related conventional non-internationalized functions when the
check fails.  This can avoid complains from people who don't
know how to set locale.  See the source code of twm I wrote
for detail.

xc/programs/twm/twm.c

loc = setlocale(LC_ALL, );
if (!loc || !strcmp(loc, C) || !strcmp(loc, POSIX) ||
!XSupportsLocale()) {
 use_fontset = False;
} else {
 use_fontset = True;
}

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: file name encoding

2001-06-26 Thread Tomohiro KUBOTA

Hi,

At 26 Jun 2001 16:37:05 -0700,
H. Peter Anvin [EMAIL PROTECTED] wrote:

 The issue is, however, what that does mean?  In particular, strings in
 the filesystem are usually in the system-wide encoding scheme, not
 what that particular user happens to be processing at the time.

Ah, I understand.  We were discussing about different theme.
My point is not on the byte sequence for filenames in the filesystem.
It can or cannot be UTF-8.  I don't care much because users have
little chance to access to the raw byte sequence on the filesystem.
My point is that user-level commands must obey locale when they
communicate with users.  For example, 'ls' must display file names
in locale encoding.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



  1   2   >