In-Reply-To: <[EMAIL PROTECTED]> On Wed, 31 Dec 2003 13:05:36 -0700 Neal Richter <[EMAIL PROTECTED]> sat down at his lightning-dragon-machine and wrote:
> ----- Japanese -------- > three different syllabic forms (Katakana, Hiragana, Kanji) Nit-pick: Kanji isn't (only) syllabic, it's (also) ideographic. ... > And there is even furigana, which is a parallel text using Kanji and > another form in parallel. Aaaargh! I was hoping for contradiction, and I thank you for it :) > There are examples of groups of characters where context > determines which characters are grouped into words, and grouping with a > first-come dictionary approach will mess it up. Soo... policy decision required. Does the team want to (try to) recruit developers who can work HtDig up into a full-blown Unicode system, dealing with non-word-breaking scripts properly? Here's someone to contact in this case: CJKV-English Dictionary http://www.acmuller.net/dealt/ Or is there a path that deals with all the issues for languages that do break words, and fails (fairly) gracefully on those that don't? Passing rapidly over my earlier stupid idea, the question in the latter case seems to be: when HtDig finds Katakana, Hiragana, Kanji or Hanja, what does it put in the index? What if it treated each Kanji and Hanja ideogram as a "word" - whether or not it is - and relied on users employing phrase searching? Yes, that chews up disk space. Doing the same for Katakana and Hiragana would do so even more. Question: is there any difference in developer effort between deploying a dictionary lookup for these alone, and doing so for all CJK? Third possibility: Yet Another Administrator Option, the default being not indexing Katakana, Hiragana, Kanji or Hanja at all and the option being, er, something like "ideogram=word". Ideally, trying to plan ahead and writing this as a stub CJK handler, invoked in all the places where it'd be needed if we ever got around to a real one. I just tried to see what www.nhk.or.jp (the public broadcaster) does with Roman keywords - and the depressing answer is that what appear to be news index pages include "noindex, nofollow"... And mozilla.org seems to have been planning "reimplementing the CJK detector" for three years or more without doing anything - can't even be sure who owns that bug these days. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
