Hi internationals, nationals and locals:

There is 4 seperate issues:
- languages (english, mandarin, cantonese)
- fonts for presentation (armanis *.ttf files)
- character encodings (e.g. EUC, Big-5...)
- algorithms of htdig

Now ignoring all presentation issues (fonts, output html tags etc),
and ignoring the language issues (fuzzy search, bad words lists)
and ignoring browser issues (how to understand a keyword that the
browser sent), leaves the character encoding.

A correct HTML page includes info about its encoding, therefore 
htdig on the receiving end can convert it to any code it likes.

If htdig uses a character encoding like EUC that is context
independent and coexists with seven-bit single-byte characters,
what actually prevents htdig from doing its thing?

Boils down to 2 questions (sorry I never looked at the source code):
        - is htdig 8-bit clean?
        - is htdig words and dictionaries sequences of bytes?
If both is yes, then I would guess the core is ok,
and we only have to look at how to use it properly.
Hope I did not overlook a parsing issue.

Oskar

Wed, 22 Mar 2000 09:06:16 -0600 Geoff Hutchison
<[EMAIL PROTECTED]> said:
> At 6:48 AM +0800 3/22/00, armani wrote:
> >After I build htdig this search engines, it will work fastly on my 
> >web server except Chinese words.
> 
> The problem is that Chinese words (and many other languages) use 
> multi-byte characters. Currently, ht://Dig does not support 
> multi-byte characters, so it cannot be used to index Chinese.
--
Dr. Oskar Bartenstein                 [EMAIL PROTECTED]
IF Computer Japan                         www.ifcomputer.com


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to