tokenized

Chris A. Mattmann (JIRA) Mon, 28 Jun 2010 22:40:51 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883415#action_12883415
 ]


Chris A. Mattmann commented on NUTCH-831:
-----------------------------------------

I applied this patch to the Nutch 1.2 branch and all tests passed:

test:
     [echo] Testing plugin: urlnormalizer-regex
    [junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.28 sec
    [junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.209 sec

test:

BUILD SUCCESSFUL
Total time: 10 minutes 50 seconds
[chipotle:~/tmp/nutch-1.2] mattmann% 

I'll commit the patch there so you can have it in SVN and use it, but I'll set 
the fix version to nil since the movement is towards Solr in the trunk. Thanks 
for the contribution, regardless, Jeroen!

Cheers,
Chris


> Allow configuration of how fields crawled by Nutch are stored / indexed / 
> tokenized
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-831
>                 URL: https://issues.apache.org/jira/browse/NUTCH-831
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jeroen van Vianen
>            Priority: Minor
>             Fix For: 2.0
>
>         Attachments: LuceneWriter.patch
>
>
> Currently, it is impossible to change the way Nutch stores / indexes / 
> tokenizes the fields it creates while crawling and indexing URLs.
> I wanted to be able to *store* the content field so I could use my own Lucene 
> code and hightlighting code to work on the stored content field. Currently, 
> content is only tokenized.
> See 
> nutch-trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexer.addIndexBackendOptions(Configuration
>  conf) for the current settings.
> There's already code in Nutch to configure how fields are stored / indexed / 
> tokenized from conf/nutch-site.xml:
> <property>
>   <name>lucene.field.store.content</name>
>   <value>YES</value>
> </property>
> (content is the name of the field)
> However, the BasicIndexer overrides these settings with its own. Attached is 
> a patch which will make sure the above settings are only applied when none 
> have been specified in nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-831) Allow configuration of how fields crawled by Nutch are stored / indexed / tokenized

Reply via email to