[htdig-dev] Re: Internationalisation, UTF-8 and entities.

Mike Holderness Fri, 02 Jan 2004 12:47:57 -0800

In-Reply-To: <[EMAIL PROTECTED]>
On Wed, 31 Dec 2003 13:05:36 -0700 Neal Richter <[EMAIL PROTECTED]> sat 
down at his lightning-dragon-machine and wrote:


> ----- Japanese --------

> three different syllabic forms (Katakana, Hiragana, Kanji)

Nit-pick: Kanji isn't (only) syllabic, it's (also) ideographic.

 ... 
>   And there is even furigana, which is a parallel text using Kanji and
> another form in parallel.

Aaaargh!

I was hoping for contradiction, and I thank you for it :)

>   There are examples of groups of characters where context
> determines which characters are grouped into words, and grouping with a
> first-come dictionary approach will mess it up.

Soo... policy decision required. 

Does the team want to (try to) recruit developers who can 
work HtDig up into a full-blown Unicode system, dealing 
with non-word-breaking scripts properly? 

  Here's someone to contact in this case: 
  CJKV-English Dictionary
  http://www.acmuller.net/dealt/

Or is there a path that deals with all the issues for 
languages that do break words, and fails (fairly) 
gracefully on those that don't? 

Passing rapidly over my earlier stupid idea, the question
in the latter case seems to be: when HtDig finds 
Katakana, Hiragana, Kanji or Hanja, what does it put
in the index? 

What if it treated each Kanji and Hanja ideogram as a 
"word" - whether or not it is - and relied on users 
employing phrase searching?  

Yes, that chews up disk space. 

Doing the same for Katakana and Hiragana would do so 
even more. Question: is there any difference in 
developer effort between deploying a dictionary 
lookup for these alone, and doing so for all CJK?  

Third possibility: Yet Another Administrator Option, the 
default being not indexing Katakana, Hiragana, Kanji or 
Hanja at all and the option being, er, something like 
"ideogram=word". 

Ideally, trying to plan ahead and writing this as a 
stub CJK handler, invoked in all the places where it'd 
be needed if we ever got around to a real one. 

I just tried to see what www.nhk.or.jp (the public 
broadcaster) does with Roman keywords - and the 
depressing answer is that what appear to be news 
index pages include "noindex, nofollow"...

And mozilla.org seems to have been planning 
"reimplementing the CJK detector" for three 
years or more without doing anything - can't 
even be sure who owns that bug these days. 




-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

[htdig-dev] Re: Internationalisation, UTF-8 and entities.

Reply via email to