[jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

DOAN DuyHai (JIRA) Sun, 26 Jun 2016 00:54:52 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350028#comment-15350028
 ]


DOAN DuyHai commented on CASSANDRA-12078:
-----------------------------------------

[~xedin]

 I have been able to reproduce the unit test failing locally. The error comes 
from test {{testTokenizationAdventuresOfHuckFinn}}. After switching skip stop 
words before stemming, the expected tokens count is *37739* and not *40249*

 There is also a {{NullPointerException}} when switching skip stop words before 
stemming. Indeed in some case, the token is removed by stop words filter so the 
input of the stemming filter is null. I've added an extra null check in the 
{{DefaultStemmingFilter}}

{code:java}
        public String process(String input) throws Exception
        {
            if (input == null || stemmer == null)
                return input;
            stemmer.setCurrent(input);
            return (stemmer.stem()) ? stemmer.getCurrent() : input;
        }
{code}

 I have also added a new unit test in {{StandardAnalyzerTest}} to cover the 
french issue mentioned above:

{code:java}
    @Test
    public void testSkipStopWordBeforeStemmingFrench() throws Exception
    {
        InputStream is = StandardAnalyzerTest.class.getClassLoader()
               
.getResourceAsStream("tokenization/french_skip_stop_words_before_stemming.txt");

        StandardTokenizerOptions options = new 
StandardTokenizerOptions.OptionsBuilder().stemTerms(true)
                .ignoreStopTerms(true).useLocale(Locale.FRENCH)
                .alwaysLowerCaseTerms(true).build();
        StandardAnalyzer tokenizer = new StandardAnalyzer();
        tokenizer.init(options);

        List<ByteBuffer> tokens = new ArrayList<>();
        List<String> words = new ArrayList<>();
        tokenizer.reset(is);
        while (tokenizer.hasNext())
        {
            final ByteBuffer nextToken = tokenizer.next();
            tokens.add(nextToken);
            
words.add(UTF8Serializer.instance.deserialize(nextToken.duplicate()));
        }

        assertEquals(4, tokens.size());
        assertEquals("dans", words.get(0));
        assertEquals("plui", words.get(1));
        assertEquals("chanson", words.get(2));
        assertEquals("connu", words.get(3));
    }
{code}

> [SASI] Move skip_stop_words filter BEFORE stemming
> --------------------------------------------------
>
>                 Key: CASSANDRA-12078
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
>             Project: Cassandra
>          Issue Type: Bug
>          Components: sasi
>         Environment: Cassandra 3.7, Cassandra 3.8
>            Reporter: DOAN DuyHai
>            Assignee: DOAN DuyHai
>             Fix For: 3.8
>
>         Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
>     private FilterPipelineTask getFilterPipeline()
>     {
>         FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>      ...
>         if (options.shouldStemTerms())
>             builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
>         if (options.shouldIgnoreStopTerms())
>             builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
>         return builder.build();
>     }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

Reply via email to