[ https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeroen van Vianen updated NUTCH-831: ------------------------------------ Attachment: LuceneWriter.patch Here's the patch to LuceneWriter > Allow configuration of how fields crawled by Nutch are stored / indexed / > tokenized > ----------------------------------------------------------------------------------- > > Key: NUTCH-831 > URL: https://issues.apache.org/jira/browse/NUTCH-831 > Project: Nutch > Issue Type: Improvement > Components: indexer > Reporter: Jeroen van Vianen > Priority: Minor > Fix For: 1.1 > > Attachments: LuceneWriter.patch > > > Currently, it is impossible to change the way Nutch stores / indexes / > tokenizes the fields it creates while crawling and indexing URLs. > I wanted to be able to *store* the content field so I could use my own Lucene > code and hightlighting code to work on the stored content field. Currently, > content is only tokenized. > See > nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration > conf) for the current settings. > There's already code in Nutch to configure how fields are stored / indexed / > tokenized from conf/nutch-site.xml: > <property> > <name>lucene.field.store.content</name> > <value>YES</value> > </property> > (content is the name of the field) > However, the BasicIndexer overrides these settings with its own. Attached is > a patch which will make sure the above settings are only applied when none > have been specified in nutch-site.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.