One more thing, I'm running on Linux 2.6.21 with en_US ISO-8859-1 charset 
for the OS.  Could that make a difference?

Thanks for any help.

Chris....

Chris Hane wrote:
> 
> I have setup nutch 0.9 and things are working correctly except the 
>   sequence is being converted to: Â
> 
> The character encoding in the html pages is windows-1252.
> 
> A sample snippet that is converted is:
> 
> ======= start ============
> <td align="center">
> 
>       <b><font face="Arial" size="0">Address: 120 South
> 7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
>       </td>
> ======== end ==============
> 
> When I look at the parsed text (using bin/nutch readseg...) it looks like:
> 
> ======== start ===========
> 120 South 7th Street  -  Terre Haute, IN 47807
> ======== end =============
> 
> Is there a way to get the &nbsp; to either be ignored or translated 
> correctly to the space character?
> 
> Thanks,
> Chris....
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to