[ https://issues.apache.org/jira/browse/LUCENE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948157#comment-16948157 ]
Chongchen Chen commented on LUCENE-8137: ---------------------------------------- In QueryBuilder.analyzeGraphBoolean, it create GraphTokenStreamFiniteStrings from input token stream first. GraphTokenStreamFiniteStrings will build an automata from the graph, it will trim the graph. in test case the graph before trim is ``` guinea 0 -------> 1 2 |___________________| cavy # 2 is the finish state. ``` but state 1 cannot reach finish state. so it's trimmed. code is in GraphTokenStreamFiniteStrings.java ```java public GraphTokenStreamFiniteStrings(TokenStream in) throws IOException { Automaton aut = build(in); this.det = Operations.removeDeadStates(Operations.determinize(aut, DEFAULT_MAX_DETERMINIZED_STATES)); } ``` > GraphTokenStreamFiniteStrings does not handle position inc > 1 in multi-word > synoyms > ------------------------------------------------------------------------------------ > > Key: LUCENE-8137 > URL: https://issues.apache.org/jira/browse/LUCENE-8137 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 7.2.1, 8.0 > Reporter: Jim Ferenczi > Assignee: Jim Ferenczi > Priority: Major > Attachments: SGF_SF_interaction.patch > > > The automaton built for graph queries that contain multiple multi-word > synonyms does not handle gaps if they appear in the middle of a multi-word > synonym. In such case the token next to the gap is considered as part of the > multi-word synonym. > Stop words that appear before or after multi-word synonyms are handled > correctly in the current version but the synonym rule "part of speech, pos" > for instance does not create the expected query if "of" is removed by a > filter that is set after the synonym_graph. One solution would be to reuse > TokenStreamToAutomaton (with minor changes to add the ability to create token > transitions rather than chars) which preserves gaps (as a transition) in the > produced automaton. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org