Re: [Nutch-general] nbsp converted to funky character

Chris Hane Wed, 18 Jul 2007 15:14:10 -0700

Any suggestions?  I finally modified the 
org.apache.nutch.parse.html.HtmlParser to remove the &nbsp; from the input 
stream before passing it to the NekoHTML or TagSoup parsers (both have this 
issue).


I also opened a JIRA so that this issue isn't lost:
https://issues.apache.org/jira/browse/NUTCH-519

Chris....

Chris Hane wrote:
> 
> I have setup nutch 0.9 and things are working correctly except the 
> &nbsp; sequence is being converted to: Â
> 
> The character encoding in the html pages is windows-1252.
> 
> A sample snippet that is converted is:
> 
> ======= start ============
> <td align="center">
> 
>       <b><font face="Arial" size="0">Address: 120 South
> 7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
>       </td>
> ======== end ==============
> 
> When I look at the parsed text (using bin/nutch readseg...) it looks like:
> 
> ======== start ===========
> 120 South 7th StreetÂ  -Â  Terre Haute, IN 47807
> ======== end =============
> 
> Is there a way to get the &nbsp; to either be ignored or translated 
> correctly to the space character?
> 
> Thanks,
> Chris....
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] nbsp converted to funky character

Reply via email to