Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized -----------------------------------------------------------------------------------
Key: NUTCH-831 URL: https://issues.apache.org/jira/browse/NUTCH-831 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Jeroen van Vianen Priority: Minor Fix For: 1.1 Currently, it is impossible to change the way Nutch stores / indexes / tokenizes the fields it creates while crawling and indexing URLs. I wanted to be able to *store* the content field so I could use my own Lucene code and hightlighting code to work on the stored content field. Currently, content is only tokenized. See nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration conf) for the current settings. There's already code in Nutch to configure how fields are stored / indexed / tokenized from conf/nutch-site.xml: <property> <name>lucene.field.store.content</name> <value>YES</value> </property> (content is the name of the field) However, the BasicIndexer overrides these settings with its own. Attached is a patch which will make sure the above settings are only applied when none have been specified in nutch-site.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.