Greetings,
I am trying to add change the behavior of how Nutch indexes web pages
content information. I would like to have the content of intranet web
pages be stored Based on previous information that I found reading
through the mailing list archives the recommend way to achieve this
is to modify BasicIndexingFilter.java on line 72:
change
doc.add(new Field("content", parse.getText(), Field.Store.NO,
Field.Index.TOKENIZED));
to
doc.add(new Field("content", parse.getText(), Field.Store.YES,
Field.Index.TOKENIZED));
After making these changes, rebuilding Nutch, the field of content is
still not stored in the index (yes I realize that you do not need to
store the index to be able to search through it but for other reasons
I would like to store this field). What am i doing wrong? In fact no
matter what values I put into BasicIndexingFilter I can not them to
apply. I am verifying this by looking at the index with Luke 0.7.
To index my local intranet site (actually at this point just a sample
Apache install I am running:
bin/nutch crawl urls -dir ~/lucene/localhost -depth 3
In my nutch-default.xml file I have tried changing the
indexingfilter.order value from blank to
<name>indexingfilter.order</name>
<value>org.apache.nutch.indexer.basic.BasicIndexingFilter
org.apache.nutch.indexer.more.MoreIndexingFilter</value>
I have also made sure that the plugin-includes will pickup the index-
basic plugin
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-
basic|index-more|query-(basic|site|url|more)|summary-basic|scoring-
opic|urlnormalizer-(pass|regex|basic)</value>
Some other specifics about my configuration:
java version "1.5.0_07"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
Nutch 0.9
OS X 10.4.9
So what am I missing - does anyone have any idea why I can not get
the content to be stored?-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general