[
https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789544#comment-16789544
]
Alan Woodward commented on LUCENE-8717:
---------------------------------------
I updated the patch to add support in GraphTokenFilter; now calling
`incrementBaseToken`, `incrementGraph` and `incrementGraphToken` will all do
the right thing and ignore tokens that have been deleted. I also added a test
to FixedShingleFilterTest to demonstrate how this gets passed through to
implementing filters.
> We'd need to change all consumer of TokenStreams
Only those consumers that expect graphs, though? Which should all be using
GraphTokenStreamFiniteStrings anyway. And I think that GraphTokenFilter helps
things here immensely as well - I should open a separate issue to try and use
it in SynonymGraphFilter to allow it to consume incoming graphs, basically any
TokenFilter that reads ahead in the tokenstream should be using
GraphTokenFilter now.
> Handle stop words that appear at articulation points
> ----------------------------------------------------
>
> Key: LUCENE-8717
> URL: https://issues.apache.org/jira/browse/LUCENE-8717
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8717.patch, LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term
> synonym starts with a stopword. This means that given a synonym file
> containing the mapping "the walking dead => twd" and a standard english
> stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords,
> which is to just remove them entirely from the token stream and use a larger
> position increment on subsequent tokens, doesn't work when the removed token
> also has a position length greater than 1. There are various tricks you can
> do to increment position length on the previous token, but this doesn't work
> if the stopword is the first token in the token stream, or if there are
> multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only
> use on tokens that should be removed from the stream but which hold necessary
> information about the structure of the token graph. These tokens can then be
> removed by GraphTokenStreamFiniteStrings at query time, and by
> FlattenGraphFilter at index time.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]