[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

Alan Woodward (JIRA) Mon, 01 Apr 2019 07:35:18 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806850#comment-16806850
 ]


Alan Woodward commented on LUCENE-8717:
---------------------------------------

{{StopFilter}} extends {{FilteringTokenFilter}} so it will handle things in the 
same way.  I've gone back and forth a bit on whether we should use 
{{TermDeletedAttribute}} all the time, or whether we can restrict it to just 
articulation points, but I think we should probably do the latter.  
Articulation points will only appear in token graphs, and there are a whole 
bunch of token filters that don't really make sense in that context, so with 
this change we only need to update a few filters; if we extend it so that 
everything needs to understand TermDeletedAttribute then there are a whole 
bunch of filters that we need to update (for example, all the 
summarizing/hashing filters - do we include deleted terms or not here?)

For synonyms, this still doesn't quite work because of the way SynonymMap 
builds itself, but I think cutting SynonymGraphFilter over to use 
GraphTokenFilter will make it a lot easier to match multi-token inputs with 
stopwords.  That would be a separate issue though.

> Handle stop words that appear at articulation points
> ----------------------------------------------------
>
>                 Key: LUCENE-8717
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8717
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8717.patch, LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term 
> synonym starts with a stopword.  This means that given a synonym file 
> containing the mapping "the walking dead => twd" and a standard english 
> stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords, 
> which is to just remove them entirely from the token stream and use a larger 
> position increment on subsequent tokens, doesn't work when the removed token 
> also has a position length greater than 1.  There are various tricks you can 
> do to increment position length on the previous token, but this doesn't work 
> if the stopword is the first token in the token stream, or if there are 
> multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only 
> use on tokens that should be removed from the stream but which hold necessary 
> information about the structure of the token graph.  These tokens can then be 
> removed by GraphTokenStreamFiniteStrings at query time, and by 
> FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

Reply via email to