Lewis John McGibbney created NUTCH-2206:
-------------------------------------------

             Summary: Provide example scoring.similarity.stopword.file
                 Key: NUTCH-2206
                 URL: https://issues.apache.org/jira/browse/NUTCH-2206
             Project: Nutch
          Issue Type: Bug
          Components: plugin, scoring
    Affects Versions: 1.11
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.12


The scoring-similarity plugin does not provide an example file for the property 
scoring.similarity.stopword.file.
This is an issue for a number of reasons, namely 
 * A user does not know what it is meant to look like, and
 * We always check of this file and will [throw an exception if it is not 
found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80],
 this may not be picked up by the user until much later.

I suggest a simple fix here, simply include the [standard English stop words 
taken from Lucene's 
StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt].
 The comments will help people to easily customize the list to whatever they 
require. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to