[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

Alan Woodward (JIRA) Wed, 06 Mar 2019 02:54:04 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785496#comment-16785496
 ]


Alan Woodward commented on LUCENE-8717:
---------------------------------------

Here's a patch implementing my idea above:
 * FilteringTokenFilter now extends GraphTokenFilter, and checks if a token to 
be removed is at an articulation point in the graph; if it is, then it's marked 
as deleted using a new TermDeletedAttribute
 * GraphTokenStreamFiniteStrings is changed to cache whole token State, rather 
than just terms and increments, so that it can detect TermDeletedAttribute and 
ignore terms marked with it when building its finite string tokenstreams
 * FlattenGraphFilter also detects terms marked as deleted and skips over them

> Handle stop words that appear at articulation points
> ----------------------------------------------------
>
>                 Key: LUCENE-8717
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8717
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term 
> synonym starts with a stopword.  This means that given a synonym file 
> containing the mapping "the walking dead => twd" and a standard english 
> stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords, 
> which is to just remove them entirely from the token stream and use a larger 
> position increment on subsequent tokens, doesn't work when the removed token 
> also has a position length greater than 1.  There are various tricks you can 
> do to increment position length on the previous token, but this doesn't work 
> if the stopword is the first token in the token stream, or if there are 
> multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only 
> use on tokens that should be removed from the stream but which hold necessary 
> information about the structure of the token graph.  These tokens can then be 
> removed by GraphTokenStreamFiniteStrings at query time, and by 
> FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

Reply via email to