Greetings,

I am trying to add change the behavior of how Nutch indexes web pages content information. I would like to have the content of intranet web pages be stored Based on previous information that I found reading through the mailing list archives the recommend way to achieve this is to modify BasicIndexingFilter.java on line 72:

change
doc.add(new Field("content", parse.getText(), Field.Store.NO, Field.Index.TOKENIZED));

to

doc.add(new Field("content", parse.getText(), Field.Store.YES, Field.Index.TOKENIZED));

After making these changes, rebuilding Nutch, the field of content is still not stored in the index (yes I realize that you do not need to store the index to be able to search through it but for other reasons I would like to store this field). What am i doing wrong? In fact no matter what values I put into BasicIndexingFilter I can not them to apply. I am verifying this by looking at the index with Luke 0.7.

To index my local intranet site (actually at this point just a sample Apache install I am running:
bin/nutch crawl urls -dir ~/lucene/localhost -depth 3

In my nutch-default.xml file I have tried changing the indexingfilter.order value from blank to
<name>indexingfilter.order</name>
<value>org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter</value>

I have also made sure that the plugin-includes will pickup the index- basic plugin

<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index- basic|index-more|query-(basic|site|url|more)|summary-basic|scoring- opic|urlnormalizer-(pass|regex|basic)</value>

Some other specifics about my configuration:
java version "1.5.0_07"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
Nutch 0.9
OS X 10.4.9

So what am I missing - does anyone have any idea why I can not get the content to be stored?
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to