On Fri, 4 Aug 2000, Markus Kuhn wrote:
> Takuhiro Nishioka wrote on 2000-08-03 19:56 UTC:
> > > Given the limitations of the mechanism, I guess it is best to treat each
> > > Kanji character as a word on its own.
> >
> > I don't know about mechanism. In my hanble opinion, I
> > think that it is a bit inconvinient that each Kanji
> > character is treated as a word, at least when editing
> > Japanese texts.
>
> If you prefer that each consecutive sequence of Kanji is treated
> like a single word, then replace
>
> SetCharacterClassRange(0x3300, 0x9fff, -1); /* CJK Ideographs */
> SetCharacterClassRange(0xf900, 0xfaff, -1); /* CJK Ideographs */
> by
> SetCharacterClassRange(0x3300, 0x9fff, 0x4e00); /* CJK Ideographs */
> SetCharacterClassRange(0xf900, 0xfaff, 0x4e00); /* CJK Ideographs */
>
> Is this more useful?
>
> How this works is as follows: SetCharacterClassRange(a, b, c) assigns to
> characters in the interval [a, b] the class code c. Class code -1 means
> that the number of the character is the class code. Word selection goes
> from the selected character to the left and right, until it hits a
> different class code. Usually, the class code is one representative
I think attempting to arrive at a definition for a "word", without the aid
of a "dictionary", for written languages that do not explicitly mark them
(i.e., spaces, etc) will be difficult. In the Japanese and Korean* cases
(but not Chinese), this is further complicated by them being agglutinative
languages (like Finnish and Turkish), where the boundaries between "words"
becomes fuzzier.
* Korean is written with spaces explictly to delimit "words", I believe.
I don't think treating contiguous sequences of U+3400 .. U+9FFF and U+F900
.. U+FAFF as a "word" is a perfect default either; in the case of
Japanese, there are "words" composed of kanji + kana combinations (most
commonly verbs and adjectives, but also some nouns); in the case of
Chinese, where all text consists of characters, this would select entire
phrases or sentences (depending on where the punctuation is). (I don't
know enough about Korean writing to comment on its situation.) The
original treatment where each of U+3400 .. U+9FFF and U+F900 .. U+FAFF is
treated as a "word" is better for Chinese, since they can be considered
atomic units of sorts (approximately on the level of morphemes) in caes
where they aren't words, but this isn't true for Japanese text.
Also, why the choice of "0x3300" above? That's the start of "CJK
Compatibility; it should be "0x3400".
More elegant character classes could be devised, of course, but they'd
have to be customized to each (written) language. (In regards to the
Japanese "word"-segmentation case, perhaps something can be done with a
list of particles 'ni-te-wo-ha' for a simple parser.)
Thomas Chan
[EMAIL PROTECTED]
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/