I have setup nutch 0.9 and things are working correctly except the   
sequence is being converted to: Â

The character encoding in the html pages is windows-1252.

A sample snippet that is converted is:

======= start ============
<td align="center">

       <b><font face="Arial" size="0">Address: 120 South
7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
       </td>
======== end ==============

When I look at the parsed text (using bin/nutch readseg...) it looks like:

======== start ===========
120 South 7th Street  -  Terre Haute, IN 47807
======== end =============

Is there a way to get the &nbsp; to either be ignored or translated 
correctly to the space character?

Thanks,
Chris....

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to