Edans Sandes created SOLR-11605:
-----------------------------------

             Summary: ShingleFilter should have an option to skip filler tokens 
(e.g. stop words)
                 Key: SOLR-11605
                 URL: https://issues.apache.org/jira/browse/SOLR-11605
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Schema and Analysis
    Affects Versions: 7.1
            Reporter: Edans Sandes


ShingleFilterFactory should have an option to ignore filler tokens in the total 
shingle size. 
For instance (adapted from 
[https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
 consider the text "A brown fox quickly jumps over the lazy dog". When we 
remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us 
the following result:

1. _ brown fox
2. brown fox quickly
3. fox quickly jump
4. quickly jump _
5. jump _ _
6. _ _ lazy
7. _ lazy dog

We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
1. brown fox quickly
2. fox quickly jump
3. quickly jump lazy
4. jump lazy dog

To maintain backward compatibility, i suggest the creation of an option called 
"skipFillerTokens" to implement this behavior (note that this is different than 
using fillerTokens="", since the empty string occupies one token in the shingle)

I will attach a patch for the ShingleFilter class (getNextToken() method).






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to