On Jun 18, 2007, at 7:36 PM, Micah Vivion wrote:
>
> I am trying to add change the behavior of how Nutch indexes web  
> pages content information. I would like to have the content of  
> intranet web pages be stored  Based on previous information that I  
> found reading through the mailing list archives the recommend way  
> to achieve this is to modify BasicIndexingFilter.java on line 72:
>
> change
> doc.add(new Field("content", parse.getText(), Field.Store.NO,  
> Field.Index.TOKENIZED));
>
> to
>
> doc.add(new Field("content", parse.getText(), Field.Store.YES,  
> Field.Index.TOKENIZED));
>
> After making these changes, rebuilding Nutch, the field of content  
> is still not stored in the index


Just to be clear, you are re-crawling after making this change? You  
need to delete the index and re-crawl before seeing this change.

If you are, make sure bin/nutch is accessing the right .jar. Simplest  
way to test this is to log or print a debug string right before the  
doc.add() line you edited.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to