tokenized

Jeroen van Vianen (JIRA) Wed, 23 Jun 2010 06:21:23 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeroen van Vianen updated NUTCH-831:
------------------------------------

    Attachment: LuceneWriter.patch

Here's the patch to LuceneWriter

> Allow configuration of how fields crawled by Nutch are stored / indexed / 
> tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / 
> tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene 
> code and hightlighting code to work on the stored content field. Currently, 
> content is only tokenized.
> See 
> nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration
>  conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / 
> tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is 
> a patch which will make sure the above settings are only applied when none 
> have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Reply via email to