I18N of Nutch (was ..Re: [Nutch-dev] Patch to access ParseData of a search hit)

Jungshik Shin Tue, 13 Jul 2004 06:04:09 -0700

Andrzej Bialecki wrote:

Stefan Groschupf wrote:
My patch simply allows you to retrieve also the metadata about the page - e.g. language code, outlinks etc, and _most_ importantly the character encoding, so that you can encode the output properly.

Thanks for this change. This is one of places where Nutch lags behind Google and I'm glad to see that it's finally addressed although I'm at the same time a bit disappointed that what I planned to contribute to the project was taken care by you :-). BTW, sometimes it's not easy to figure out 'charactere encoding' of an html document (because neither is it declared in the http header nor is it declared in the meta tag). I believe google uses a commercial character encoding detector (most likely that made by Basis Technology) to cope with this problem. Mozilla has two character encoding detectors , but they're not as good as Basis' (they claim that their detection rate is over 98%) :

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

As for google, I kept asking them to make use 'language code' when displaying their search results, but they'd not. Now that you made lang code available, Nutch may specify the language with html 'lang' attribute. See the following pages for 'lang':

http://www.w3.org/International/questions/qa-lang-why
http://www.w3.org/International/tutorials/tutorial-lang/

http://www.w3.org/International/resource-index.html
has links to a lot of other tutorials and faqs on I18N.

Currently there is no way to figure out the original character encoding, because you only get byte[] from getContent(HitDetails). So, if you have indexed e.g. Russian pages encoded in KOI-R, you won't get that information when trying to display a page - which involves doing "new String(content)", which assumes the default platform encoding - most probably Latin1. You have broken the content already at this point. But the next step is to send it to the browser, and the Servlet container also needs to know what character encoding to apply when you output a Unicode String. Most often it applies Latin1, but you can specifically request e.g. UTF-8. Now the output is broken beyond recognition.

As you would agree, Nutch should display its search results in UTF-8 (by default) and should explicitly declare to that effect so that it can represent many different scripts in a single script.

Jungshik

------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

I18N of Nutch (was ..Re: [Nutch-dev] Patch to access ParseData of a search hit)

Reply via email to