Re: xterm Unicode word-selection character classes

Markus Kuhn Thu, 03 Aug 2000 06:00:32 -0700
Bram Moolenaar wrote on 2000-08-03 12:10 UTC:
> Takuhiro Nishioka <[EMAIL PROTECTED]>:
> > I've tested xterm with utf-8 patch.  It seems almost OK,
> > but not all scripts are classified well.  For example,
> > here is a WORD:
> > 
> >     XXXYYY
> > 
> > where "XXX" is a sequence of Hiragana characters and "YYY"
> > is a sequence of Katakana characters.  But left-mouse
> > double clicking select "XXXYYY".

That one is quite easy to fix:

Add in the routine init_classtab() the following:

@@ -104,6 +104,11 @@
   SetCharacterClassRange(0x2080, 0x208f, 0x2080); /* subscript */
   SetCharacterClassRange(0x3000, 0x3000, 32); /* ideographic space */
   SetCharacterClassRange(0x3001, 0x3020, -1); /* ideographic punctuation */
+  SetCharacterClassRange(0x3040, 0x309f, 0x3040); /* Hiragana */
+  SetCharacterClassRange(0x30a0, 0x30ff, 0x30a0); /* Katakana */
+  SetCharacterClassRange(0x3300, 0x9fff, -1); /* CJK Ideographs */
+  SetCharacterClassRange(0xac00, 0xd7a3, 0xac00); /* Hangul Syllables */
+  SetCharacterClassRange(0xf900, 0xfaff, -1); /* CJK Ideographs */
   SetCharacterClassRange(0xfe30, 0xfe6b, -1); /* punctuation forms */
   SetCharacterClassRange(0xff00, 0xff0f, -1); /* half/fullwidth ASCII */
   SetCharacterClassRange(0xff1a, 0xff20, -1); /* half/fullwidth ASCII */

Are there other scripts than the above treated ones, where words are not
separated by space or punctuation?

Given the limitations of the mechanism, I guess it is best to treat each
Kanji character as a word on its own.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: xterm Unicode word-selection character classes

Reply via email to