>Does anybody know how to set another character
>encoding than UTF-8, which seems to be the default in
>Nutch 0.8.1 on Tomcat 5 ? (Ubuntu 6.10 / Tomcat 5.0)
>
>What I have tried :
>
>In <tomcat_root>/conf/web.xml :
>(in jsp section) :
>Added :
><init-param>
><param-name>javaEncoding</param-name>
><param-value>ISO-8859-1</param-value>
></init-param>
>
>In <tomcat_root>/webapps/ROOT/WEB-INF/web.xml :
>(in <servlet-name>Cached</servlet-name> section)
>Added :
><init-param>
><param-name>javaEncoding</param-name>
><param-value>ISO-8859-1</param-value>
></init-param>
>
>Stopped and restarted Tomcat (from the crawldir folder
>of Nutch)
>
>The browser keeps showing UTF-8 encoded pages, and
>french special characters are being replaced with
>wrong characters.

I'm not a .jsp jock, but I believe the UTF-8 encoding is baked into 
the pages. See this search 
(http://krugle.com/kse/files?query=utf-8&lang=jsp&project=nutch), 
where you'll get a bunch of .jsp pages in Nutch that have the UTF-8 
encoding in the HTML sections.

But leaving that aside, in general UTF-8 is the safest encoding to 
use. If a browser is showing "wrong characters", and the browser is 
relatively new, then my guess would be that there was an encoding 
problem when the data was initially parsed. So it wound up in the 
Nutch segments/index with the wrong value.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to