tokenized

Jeroen van Vianen (JIRA) Wed, 23 Jun 2010 06:18:19 -0700

Allow configuration of how fields crawled by Nutch are stored / indexed / 
tokenized
-----------------------------------------------------------------------------------


                 Key: NUTCH-831
                 URL: https://issues.apache.org/jira/browse/NUTCH-831
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
            Reporter: Jeroen van Vianen
            Priority: Minor
             Fix For: 1.1


Currently, it is impossible to change the way Nutch stores / indexes / 
tokenizes the fields it creates while crawling and indexing URLs.

I wanted to be able to *store* the content field so I could use my own Lucene 
code and hightlighting code to work on the stored content field. Currently, 
content is only tokenized.

See 
nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration
 conf) for the current settings.

There's already code in Nutch to configure how fields are stored / indexed / 
tokenized from conf/nutch-site.xml:

<property>
  <name>lucene.field.store.content</name>
  <value>YES</value>
</property>

(content is the name of the field)

However, the BasicIndexer overrides these settings with its own. Attached is a 
patch which will make sure the above settings are only applied when none have 
been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Reply via email to