Hi all,

I have been trying to solve an issue where FlattenGraphFilter (FGF) removes 
tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that 
contain the contraction "can't" do not match.

This is on Solr version 7.7.1.

The field in question is defined as follows:

<field name="myField" type="text_general" indexed="true" stored="true"/>

And the relevant fieldType "text_general":

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" 
stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" 
splitOnCaseChange="0"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterGraphFilterFactory" 
stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" 
splitOnCaseChange="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    </analyzer>
</fieldType>

Finally, the relevant entries in synonyms.txt are:

can,cans
cants,cant

Using the Solr console Analysis and "can't" as the Field Value, the following
tokens are produced (find the verbose output at the bottom of this email):

Index
ST    | can't
SF    | can't
WDGF  | cant | can't | can | t
FGF   | cant | can't | can | t
SGF   | cants | cant | can't | | cans | can | t
ICUFF | cants | cant | can't | | cans | can | t
FGF   | cants | cant | can't | | t

Query
ST    | can't
SF    | can't
WDGF  | can | t
SF    | can | t
ICUFF | can | t

As you can see after the FGF the tokens "can" and "cans" are pruned so the query
does not match. Is there a reasonable way to preserve these tokens?

My key concern is that I want the "fix" for this to have as little impact on 
other queries as possible.

Some things I have checked/tried:

Searching for similar problems I found this thread: 
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
Here it is suggested that FGF is not necessary (without any supporting 
evidence). This goes directly against the documentation that states "If you use 
[the SynonymGraphFilter] during indexing, you must follow it with a Flatten 
Graph Filter":
https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
Despite this warning I tried out removing the FGF on a local 
cluster and indeed it still runs and this search now works, however I am 
paranoid that this will break far more things than it fixes.

I have tried adding the FGF as a filter to the query. This does not eliminate 
the "can" term in the query analysis.

I have tested other contracted words. Some have this issue as well - others do
not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all 
preserve their tokens "won't" does not. I believe the pattern here is that 
whenever part of the contraction has synonyms this problem manifests.

Eliminating WDGF is not viable as we rely on this functionality for other uses
of delimiters (such as wi-fi -> wi fi).

Performing WDGF after synonyms is also not viable as in the case that we have
the data "historical-text" we want this to match the search "history text".

The hacky solution I have found is to use the PatternReplaceFilterFactory to
replace "can't" with "cant". Though this technically solves the issue, I hope it
is obvious why this does not feel like an ideal solution.

Has anyone encountered this type of issue before? Any advice on how the filter 
use here could be improved to handle this case?

Thanks,
Eric Buss


PS. The verbose output from Analysis of "can't"

Index

ST    | text          | can't            | 
      | raw_bytes     | [63 61 6e 27 74] | 
      | start         | 0                | 
      | end           | 5                | 
      | positionLength| 1                | 
      | type          | <ALPHANUM>       | 
      | termFrequency | 1                | 
      | position      | 1                | 
SF    | text          | can't            | 
      | raw_bytes     | [63 61 6e 27 74] | 
      | start         | 0                | 
      | end           | 5                | 
      | positionLength| 1                | 
      | type          | <ALPHANUM>       | 
      | termFrequency | 1                | 
      | position      | 1                | 
WDGF  | text          | cant          | can't            | can        | t       
   | 
      | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]    
   | 
      | start         | 0             | 0                | 0          | 4       
   | 
      | end           | 5             | 5                | 3          | 5       
   | 
      | positionLength| 2             | 2                | 1          | 1       
   | 
      | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
<ALPHANUM> | 
      | termFrequency | 1             | 1                | 1          | 1       
   | 
      | position      | 1             | 1                | 1          | 2       
   | 
      | keyword       | false         | false            | false      | false   
   | 
FGF   | text          | cant          | can't            | can        | t       
   | 
      | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]    
   | 
      | start         | 0             | 0                | 0          | 4       
   | 
      | end           | 5             | 5                | 3          | 5       
   | 
      | positionLength| 2             | 2                | 1          | 1       
   | 
      | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
<ALPHANUM> | 
      | termFrequency | 1             | 1                | 1          | 1       
   | 
      | position      | 1             | 1                | 1          | 2       
   | 
      | keyword       | false         | false            | false      | false   
   | 
SGF   | text          | cants            | cant          | can't            | 
cans          | can        | t          | 
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
[63 61 6e 73] | [63 61 6e] | [74]       | 
      | start         | 0                | 0             | 0                | 0 
            | 0          | 4          | 
      | end           | 5                | 5             | 5                | 3 
            | 3          | 5          | 
      | positionLength| 1                | 1             | 2                | 1 
            | 1          | 1          | 
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
SYNONYM       | <ALPHANUM> | <ALPHANUM> | 
      | termFrequency | 1                | 1             | 1                | 1 
            | 1          | 1          | 
      | position      | 1                | 1             | 1                | 3 
            | 3          | 4          | 
      | keyword       | false            | false         | false            | 
false         | false      | false      | 
FGF   | text          | cants            | cant          | can't            | t 
         | 
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
[74]       | 
      | start         | 0                | 0             | 0                | 4 
         | 
      | end           | 5                | 5             | 5                | 5 
         | 
      | positionLength| 1                | 1             | 1                | 1 
         | 
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
<ALPHANUM> | 
      | termFrequency | 1                | 1             | 1                | 1 
         | 
      | position      | 1                | 1             | 1                | 3 
         | 
      | keyword       | false            | false         | false            | 
false      | 
ICUFF | text          | cants            | cant          | can't            | t 
         | 
      | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 74] | 
[74]       | 
      | start         | 0                | 0             | 0                | 4 
         | 
      | end           | 5                | 5             | 5                | 5 
         | 
      | positionLength| 1                | 1             | 1                | 1 
         | 
      | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>       | 
<ALPHANUM> | 
      | termFrequency | 1                | 1             | 1                | 1 
         | 
      | position      | 1                | 1             | 1                | 3 
         | 
      | keyword       | false            | false         | false            | 
false      | 

Query

ST    | text          | can't            | 
      | raw_bytes     | [63 61 6e 27 74] | 
      | start         | 0                | 
      | end           | 5                | 
      | positionLength| 1                | 
      | type          | <ALPHANUM>       | 
      | termFrequency | 1                | 
      | position      | 1                | 
SF    | text          | can't            | 
      | raw_bytes     | [63 61 6e 27 74] | 
      | start         | 0                | 
      | end           | 5                | 
      | positionLength| 1                | 
      | type          | <ALPHANUM>       | 
      | termFrequency | 1                | 
      | position      | 1                | 
WDGF  | text          | can        | t          | 
      | raw_bytes     | [63 61 6e] | [74]       | 
      | start         | 0          | 4          | 
      | end           | 3          | 5          | 
      | positionLength| 1          | 1          | 
      | type          | <ALPHANUM> | <ALPHANUM> | 
      | termFrequency | 1          | 1          | 
      | position      | 1          | 2          | 
      | keyword       | false      | false      | 
SF    | text          | can        | t          | 
      | raw_bytes     | [63 61 6e] | [74]       | 
      | start         | 0          | 4          | 
      | end           | 3          | 5          | 
      | positionLength| 1          | 1          | 
      | type          | <ALPHANUM> | <ALPHANUM> | 
      | termFrequency | 1          | 1          | 
      | position      | 1          | 2          | 
      | keyword       | false      | false      | 
ICUFF | text          | can        | t          | 
      | raw_bytes     | [63 61 6e] | [74]       | 
      | start         | 0          | 4          | 
      | end           | 3          | 5          | 
      | positionLength| 1          | 1          | 
      | type          | <ALPHANUM> | <ALPHANUM> | 
      | termFrequency | 1          | 1          | 
      | position      | 1          | 2          | 
      | keyword       | false      | false      |

Reply via email to