Stefan Groschupf wrote:
Hi Andrzej,
is it possible to show cached content with your patch?
Yes, that part is not changed at all. It was possible before, it is possible now.
However, I suspect that by "content" you mean a complete Web page, including sub-frames, scripts and images (linked resources) - but that's not the case with the current architecture. Currently the "content" equals to a single HTML page alone, without any linked resources. Some of these resources may end up in the index after all, but they are not returned by getContent().
I'm confused I think nutch does not download the images of a web page isn't it?
No, it doesn't. When you present the cached copy of an HTML page in a browser, it contains links to the original images. Your browser gets them from the original site, if they are still present there...
My patch simply allows you to retrieve also the metadata about the page - e.g. language code, outlinks etc, and _most_ importantly the character encoding, so that you can encode the output properly.
Currently there is no way to figure out the original character encoding, because you only get byte[] from getContent(HitDetails). So, if you have indexed e.g. Russian pages encoded in KOI-R, you won't get that information when trying to display a page - which involves doing "new String(content)", which assumes the default platform encoding - most probably Latin1. You have broken the content already at this point. But the next step is to send it to the browser, and the Servlet container also needs to know what character encoding to apply when you output a Unicode String. Most often it applies Latin1, but you can specifically request e.g. UTF-8. Now the output is broken beyond recognition.
This problem wasn't so obvious so far to the users, because most of the pages are in English or other Western European languages, which fit into the default Latin1 encoding.
So, using this patch you can correctly specify other encoding as necessary when converting byte[] to String, and when sending it to the browser.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
