Any suggestions? I finally modified the org.apache.nutch.parse.html.HtmlParser to remove the from the input stream before passing it to the NekoHTML or TagSoup parsers (both have this issue).
I also opened a JIRA so that this issue isn't lost: https://issues.apache.org/jira/browse/NUTCH-519 Chris.... Chris Hane wrote: > > I have setup nutch 0.9 and things are working correctly except the > sequence is being converted to:  > > The character encoding in the html pages is windows-1252. > > A sample snippet that is converted is: > > ======= start ============ > <td align="center"> > > <b><font face="Arial" size="0">Address: 120 South > 7th Street - Terre Haute, IN 47807</font></b> > </td> > ======== end ============== > > When I look at the parsed text (using bin/nutch readseg...) it looks like: > > ======== start =========== > 120 South 7th Street - Terre Haute, IN 47807 > ======== end ============= > > Is there a way to get the to either be ignored or translated > correctly to the space character? > > Thanks, > Chris.... > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
