Alan Woodward created LUCENE-8916:
-------------------------------------

             Summary: GraphTokenStreamFiniteStrings.FiniteStringsTokenStream 
does not play well with subsequent TokenFilters
                 Key: LUCENE-8916
                 URL: https://issues.apache.org/jira/browse/LUCENE-8916
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Alan Woodward
            Assignee: Alan Woodward


GraphTokenStreamFiniteStrings provides a view over multiple paths through a 
Token graph, which is useful when building queries over multiple length 
synonyms.  This view is exposed as an iterator over simple TokenStreams.  
However, these TokenStreams do not work correctly when further wrapped in token 
filters, because they do not use a CharTermAttribute.

For an example of issues this can cause, see 
https://github.com/elastic/elasticsearch/issues/43976, where elasticsearch uses 
a special shingle field to speed up phrase searches.  Queries are converted to 
shingles if they have multiple terms. However, if the query resolves into a 
graph due to synonyms, then this conversion breaks because the 
FixedShingleFilter is given a token stream built by GTSFS; terms are set using 
BytesTermAttribute, but then read using CharTermAttribute, and as these have 
different backing implementations, FSF ends up emitting null tokens.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to