Hi all, I have been trying to solve an issue where FlattenGraphFilter (FGF) removes tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that contain the contraction "can't" do not match.
This is on Solr version 7.7.1. The field in question is defined as follows: <field name="myField" type="text_general" indexed="true" stored="true"/> And the relevant fieldType "text_general": <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.FlattenGraphFilterFactory"/> <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterGraphFilterFactory" stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/> </analyzer> </fieldType> Finally, the relevant entries in synonyms.txt are: can,cans cants,cant Using the Solr console Analysis and "can't" as the Field Value, the following tokens are produced (find the verbose output at the bottom of this email): Index ST | can't SF | can't WDGF | cant | can't | can | t FGF | cant | can't | can | t SGF | cants | cant | can't | | cans | can | t ICUFF | cants | cant | can't | | cans | can | t FGF | cants | cant | can't | | t Query ST | can't SF | can't WDGF | can | t SF | can | t ICUFF | can | t As you can see after the FGF the tokens "can" and "cans" are pruned so the query does not match. Is there a reasonable way to preserve these tokens? My key concern is that I want the "fix" for this to have as little impact on other queries as possible. Some things I have checked/tried: Searching for similar problems I found this thread: https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html Here it is suggested that FGF is not necessary (without any supporting evidence). This goes directly against the documentation that states "If you use [the SynonymGraphFilter] during indexing, you must follow it with a Flatten Graph Filter": https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html Despite this warning I tried out removing the FGF on a local cluster and indeed it still runs and this search now works, however I am paranoid that this will break far more things than it fixes. I have tried adding the FGF as a filter to the query. This does not eliminate the "can" term in the query analysis. I have tested other contracted words. Some have this issue as well - others do not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all preserve their tokens "won't" does not. I believe the pattern here is that whenever part of the contraction has synonyms this problem manifests. Eliminating WDGF is not viable as we rely on this functionality for other uses of delimiters (such as wi-fi -> wi fi). Performing WDGF after synonyms is also not viable as in the case that we have the data "historical-text" we want this to match the search "history text". The hacky solution I have found is to use the PatternReplaceFilterFactory to replace "can't" with "cant". Though this technically solves the issue, I hope it is obvious why this does not feel like an ideal solution. Has anyone encountered this type of issue before? Any advice on how the filter use here could be improved to handle this case? Thanks, Eric Buss PS. The verbose output from Analysis of "can't" Index ST | text | can't | | raw_bytes | [63 61 6e 27 74] | | start | 0 | | end | 5 | | positionLength| 1 | | type | <ALPHANUM> | | termFrequency | 1 | | position | 1 | SF | text | can't | | raw_bytes | [63 61 6e 27 74] | | start | 0 | | end | 5 | | positionLength| 1 | | type | <ALPHANUM> | | termFrequency | 1 | | position | 1 | WDGF | text | cant | can't | can | t | | raw_bytes | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74] | | start | 0 | 0 | 0 | 4 | | end | 5 | 5 | 3 | 5 | | positionLength| 2 | 2 | 1 | 1 | | type | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | 1 | 1 | | position | 1 | 1 | 1 | 2 | | keyword | false | false | false | false | FGF | text | cant | can't | can | t | | raw_bytes | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74] | | start | 0 | 0 | 0 | 4 | | end | 5 | 5 | 3 | 5 | | positionLength| 2 | 2 | 1 | 1 | | type | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | 1 | 1 | | position | 1 | 1 | 1 | 2 | | keyword | false | false | false | false | SGF | text | cants | cant | can't | cans | can | t | | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e 73] | [63 61 6e] | [74] | | start | 0 | 0 | 0 | 0 | 0 | 4 | | end | 5 | 5 | 5 | 3 | 3 | 5 | | positionLength| 1 | 1 | 2 | 1 | 1 | 1 | | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | SYNONYM | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | 1 | 1 | 1 | 1 | | position | 1 | 1 | 1 | 3 | 3 | 4 | | keyword | false | false | false | false | false | false | FGF | text | cants | cant | can't | t | | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74] | | start | 0 | 0 | 0 | 4 | | end | 5 | 5 | 5 | 5 | | positionLength| 1 | 1 | 1 | 1 | | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | 1 | 1 | | position | 1 | 1 | 1 | 3 | | keyword | false | false | false | false | ICUFF | text | cants | cant | can't | t | | raw_bytes | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | [74] | | start | 0 | 0 | 0 | 4 | | end | 5 | 5 | 5 | 5 | | positionLength| 1 | 1 | 1 | 1 | | type | SYNONYM | <ALPHANUM> | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | 1 | 1 | | position | 1 | 1 | 1 | 3 | | keyword | false | false | false | false | Query ST | text | can't | | raw_bytes | [63 61 6e 27 74] | | start | 0 | | end | 5 | | positionLength| 1 | | type | <ALPHANUM> | | termFrequency | 1 | | position | 1 | SF | text | can't | | raw_bytes | [63 61 6e 27 74] | | start | 0 | | end | 5 | | positionLength| 1 | | type | <ALPHANUM> | | termFrequency | 1 | | position | 1 | WDGF | text | can | t | | raw_bytes | [63 61 6e] | [74] | | start | 0 | 4 | | end | 3 | 5 | | positionLength| 1 | 1 | | type | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | | position | 1 | 2 | | keyword | false | false | SF | text | can | t | | raw_bytes | [63 61 6e] | [74] | | start | 0 | 4 | | end | 3 | 5 | | positionLength| 1 | 1 | | type | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | | position | 1 | 2 | | keyword | false | false | ICUFF | text | can | t | | raw_bytes | [63 61 6e] | [74] | | start | 0 | 4 | | end | 3 | 5 | | positionLength| 1 | 1 | | type | <ALPHANUM> | <ALPHANUM> | | termFrequency | 1 | 1 | | position | 1 | 2 | | keyword | false | false |