So this is promising - great.

Wed, 22 Mar 2000 20:59:06 -0600 Geoff Hutchison
<[EMAIL PROTECTED]> said:

> It is 8-bit clean, but it treats characters as synonymous with 8 
> bits. Many parts of the code (the String class in particular) assume 
> that a character is only 1 byte and keeps going. In many encodings, 
> this is *not* the case, and so you're stuck.

Yes in general a character is not a byte. Still dont see,
at least for clean encodings like EUC, where this difference
should break the workings of htdig? 

> >A correct HTML page includes info about its encoding, therefore
> >htdig on the receiving end can convert it to any code it likes.
> 
> Yes, provided that it has code to convert from one encoding into 
> another. :-) This is the crux of the problem.

I would use an external converter. There is good code, e.g. 
nkf, tcs, many others. See http://ftp.monash.edu.au/pub/nihongo/

> Currently ht://Dig 
> assumes the host system has working locale support and is getting the 
> pages in the default encoding of the system. If they're not, it 
> assumes they are anyway. :-) It makes no attempt to convert character 
> encodings.

> Basically, if you have an Latin-1 encoding for your character-set, 
> you're OK. That's the limit of the current i18n.

To my best knowledge, one HTML page can only have one encoding,
but a web server can serve international pages with many encodings.
These have nothing to do with the encoding used on the machine
which runs the search engine.

A person who carefully serves an international audience will include
something like this example for EUC:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=x-euc-jp">
to allow a browser to display the page properly.

Leaves 3 tasks:
1 - Convince htdig to read these tags to get the encoding
    of the incoming page.
2 - Find a good place to attach an external converter to filter
    incoming pages.  
3 - Determine if the cgi input is understood by htsearch as it is,
    or also needs special attention?

Oskar
# armani would not need to wait for (1) since they know the encoding.
# if pages are served in EUC, I believe you can skip (2).

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to