> >   There is also the largely undefined issue of Asian word-breaking.  May
> > asian languages do not use spaces to 'break' words in text, this makes
> > it very difficult to index by word.
>
> So we need someone who reads Japanese look at, say,
> google.jp and yahoo.jp to see how they handle the
> issue? I'm taking Japanese as the hardest example,
> what with ideograms and two syllabic scripts all
> mushed up together...
>
> Musing, hoping for contradiction:
>
> 1) is it sufficient for the class to hold that all
> characters in ideogram ranges are words?

  No.

  Example in Japanese Kanji

        computer = lightning-dragon-machine (3 kanji characters).
        electricity = lightning-dragon (2 kanji characters)

  The next question might be: Couldn't we use a dictionary to break up
words?  Nope.

  There are examples of groups of characters where context
determines which characters are grouped into words, and grouping with a
first-come dictionary approach will mess it up.

  Same issue in Chinese & Korean.

  RightNow has native Japanese, Chinese & Korean speakers and I've quized
them and this is what I gleaned:


----- Japanese --------

  There are different forms of written Japanese, there is a
latin character encoded version (Romanji) and three different syllabic
forms (Katakana, Hiragana, Kanji).

  And there is even furigana, which is a parallel text using Kanji and
another form in parallel.

  Written Japanese in the web is a jumbled mix of all 4 forms, and
information entering systems are VERY complicated in comparison to out
type-it & see-it way.

  We can do romanji now.  But we will need specialized word-breakers for
each written form in Japanese.

  http://members.aol.com/writejapan/
  http://www.advancedlanguage.com/japanese.htm

------ Korean -------

  Syllabic:  Hangul & Hanja.  There is also a romanized version.

  Note that Hangul uses spaces between words!!! And is by far the most
common modern form on the web.

  http://www.nationmaster.com/encyclopedia/Korean-language

----- Chinese --------

  There are (at least) two forms of written Chinese (hong kong &
tiawanese) using the full Kanjii character set, and one common simplified
version using a kanji-subset.

  Written chinese is mostly common between cantonese vs mandarin... but
there are some differences.  And web pages and writing from Hong Kong has
many english words (effects of being a British colony).

  There are no spaces, and multiple Kanji symbols can form a single word.
Television = electic screen (2 characters)

--------

  Also note that the same Kanji character can mean something different in
Japanese/Chinese/Korean...  the Korean & Japanese adopted the characters,
but usage has diverged the written forms somewhat.

-------

  There are a few companies that specialize in the CKJ software problem.
RightNow uses Basis Tech, which provides software to help us parse the
languages... for a large fee.

  Japanese is the most complex case both because of the multiple forms and
mixing of forms common on the web.

--------

  So converting to UTF-8 is like adopting the same pencil & paper.. it
doesn't mean we can now parse the langauges.... we'll need specialized
software for each written form of each language.

  It just gets deeper and deeper doesn't it???????

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485





-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to