[ https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Woodward updated LUCENE-8717: ---------------------------------- Attachment: LUCENE-8717.patch > Handle stop words that appear at articulation points > ---------------------------------------------------- > > Key: LUCENE-8717 > URL: https://issues.apache.org/jira/browse/LUCENE-8717 > Project: Lucene - Core > Issue Type: Bug > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Attachments: LUCENE-8717.patch, LUCENE-8717.patch > > > Our set of TokenFilters currently cannot handle the case where a multi-term > synonym starts with a stopword. This means that given a synonym file > containing the mapping "the walking dead => twd" and a standard english > stopword filter, QueryBuilder will produce incorrect queries. > The tricky part here is that our standard way of dealing with stopwords, > which is to just remove them entirely from the token stream and use a larger > position increment on subsequent tokens, doesn't work when the removed token > also has a position length greater than 1. There are various tricks you can > do to increment position length on the previous token, but this doesn't work > if the stopword is the first token in the token stream, or if there are > multiple stopwords in the side path. > Instead, I'd like to propose adding a new TermDeletedAttribute, which we only > use on tokens that should be removed from the stream but which hold necessary > information about the structure of the token graph. These tokens can then be > removed by GraphTokenStreamFiniteStrings at query time, and by > FlattenGraphFilter at index time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org