Alan Woodward created LUCENE-8717:
-------------------------------------

             Summary: Handle stop words that appear at articulation points
                 Key: LUCENE-8717
                 URL: https://issues.apache.org/jira/browse/LUCENE-8717
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Alan Woodward
            Assignee: Alan Woodward


Our set of TokenFilters currently cannot handle the case where a multi-term 
synonym starts with a stopword.  This means that given a synonym file 
containing the mapping "the walking dead => twd" and a standard english 
stopword filter, QueryBuilder will produce incorrect queries.

The tricky part here is that our standard way of dealing with stopwords, which 
is to just remove them entirely from the token stream and use a larger position 
increment on subsequent tokens, doesn't work when the removed token also has a 
position length greater than 1.  There are various tricks you can do to 
increment position length on the previous token, but this doesn't work if the 
stopword is the first token in the token stream, or if there are multiple 
stopwords in the side path.

Instead, I'd like to propose adding a new TermDeletedAttribute, which we only 
use on tokens that should be removed from the stream but which hold necessary 
information about the structure of the token graph.  These tokens can then be 
removed by GraphTokenStreamFiniteStrings at query time, and by 
FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to