Hi Zsolt,

>Here is the cache view :
>http://64.34.163.57:8080/nutch-0.9/cached.jsp?idx=0&id=0

When I hit this with curl, I see that it's 
returning Content-Type: 
text/html;charset=iso-8859-2 in the response 
header, and the content has <meta 
http-equiv="Content-Type" content="text/html; 
charset=iso-8859-2">.

But I see that the base href is:

<base href="http://www.daganatok.hu/";>

And when I hit that URL, I get back:

< Content-Type: text/html; charset=utf-8
         <meta http-equiv="Content-Type" 
content="text/html; charset=utf-8" 
/>.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>

The data seems to be valid UTF-8, and from my 
experience Nutch works correctly with correctly 
identified UTF-8 web pages.

So I'm I'm guessing the '?' come about when your 
webapp container/server tries to convert the 
UTF-8 data to 8859-2.

-- Ken

>Ken Krugler ([EMAIL PROTECTED]) wrote:
>>
>>  >Hi All,
>>  >
>>  >I would like to share an issue regarding the encoding
>>  >using Nutch 0.9.x.
>>  >
>>  >When I'm indexing some sites, which contains lot of
>>  >ISO-8859-2 characters, (these are mainly eastern-european
>>  >sites, mainly hungarian ones) then at the search page
>>  >I cannot see the characters correcty. Even at the cached
>>  >view, the non-english characters like áéú&#337; are visible
>>  >as a question mark.
>>  >
>>  >If some of you, have an experience with this issue,
>>  >I would be glad when some of You can help me.
>>
>>  What's the URL of an example page with this type of problem?
>>
>  > -- Ken

-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to