>Does anybody know how to set another character >encoding than UTF-8, which seems to be the default in >Nutch 0.8.1 on Tomcat 5 ? (Ubuntu 6.10 / Tomcat 5.0) > >What I have tried : > >In <tomcat_root>/conf/web.xml : >(in jsp section) : >Added : ><init-param> ><param-name>javaEncoding</param-name> ><param-value>ISO-8859-1</param-value> ></init-param> > >In <tomcat_root>/webapps/ROOT/WEB-INF/web.xml : >(in <servlet-name>Cached</servlet-name> section) >Added : ><init-param> ><param-name>javaEncoding</param-name> ><param-value>ISO-8859-1</param-value> ></init-param> > >Stopped and restarted Tomcat (from the crawldir folder >of Nutch) > >The browser keeps showing UTF-8 encoded pages, and >french special characters are being replaced with >wrong characters.
I'm not a .jsp jock, but I believe the UTF-8 encoding is baked into the pages. See this search (http://krugle.com/kse/files?query=utf-8&lang=jsp&project=nutch), where you'll get a bunch of .jsp pages in Nutch that have the UTF-8 encoding in the HTML sections. But leaving that aside, in general UTF-8 is the safest encoding to use. If a browser is showing "wrong characters", and the browser is relatively new, then my guess would be that there was an encoding problem when the data was initially parsed. So it wound up in the Nutch segments/index with the wrong value. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
