Thanks for the reply,

I wouldn't be surprised if the issue you linked is related, I also found 
another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723 

You are absolutely right that the FlattenGraphFilter should only be used once, 
but as you noted the issue I am experiencing seems unrelated.

On 2019-12-05, 10:23 AM, "Michael Gibney" <mich...@michaelgibney.net> wrote:

    I wonder if this might be similar/related to the underlying problem
    that is intended to be addressed by
    https://issues.apache.org/jira/browse/LUCENE-8985?
    
    btw, I think you only want to use FlattenGraphFilter *once* in the
    indexing analysis chain, towards the end (after all components that
    emit graphs). ...though that's probably *not* what's causing the
    problem (based on the fact that the extra FGF doesn't seem to modify
    any attributes).
    
    
    
    On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <ericb...@abebooks.com> wrote:
    >
    > Hi all,
    >
    > I have been trying to solve an issue where FlattenGraphFilter (FGF) 
removes
    > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently 
searches that
    > contain the contraction "can't" do not match.
    >
    > This is on Solr version 7.7.1.
    >
    > The field in question is defined as follows:
    >
    > <field name="myField" type="text_general" indexed="true" stored="true"/>
    >
    > And the relevant fieldType "text_general":
    >
    > <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
    >     <analyzer type="index">
    >         <tokenizer class="solr.StandardTokenizerFactory"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
    >         <filter class="solr.WordDelimiterGraphFilterFactory" 
stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" 
splitOnCaseChange="0"/>
    >         <filter class="solr.FlattenGraphFilterFactory"/>
    >         <filter class="solr.SynonymGraphFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    >         <filter class="solr.FlattenGraphFilterFactory"/>
    >         <filter 
class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    >     </analyzer>
    >     <analyzer type="query">
    >         <tokenizer class="solr.StandardTokenizerFactory"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
    >         <filter class="solr.WordDelimiterGraphFilterFactory" 
stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" 
splitOnCaseChange="0"/>
    >         <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
    >         <filter 
class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
    >     </analyzer>
    > </fieldType>
    >
    > Finally, the relevant entries in synonyms.txt are:
    >
    > can,cans
    > cants,cant
    >
    > Using the Solr console Analysis and "can't" as the Field Value, the 
following
    > tokens are produced (find the verbose output at the bottom of this email):
    >
    > Index
    > ST    | can't
    > SF    | can't
    > WDGF  | cant | can't | can | t
    > FGF   | cant | can't | can | t
    > SGF   | cants | cant | can't | | cans | can | t
    > ICUFF | cants | cant | can't | | cans | can | t
    > FGF   | cants | cant | can't | | t
    >
    > Query
    > ST    | can't
    > SF    | can't
    > WDGF  | can | t
    > SF    | can | t
    > ICUFF | can | t
    >
    > As you can see after the FGF the tokens "can" and "cans" are pruned so 
the query
    > does not match. Is there a reasonable way to preserve these tokens?
    >
    > My key concern is that I want the "fix" for this to have as little impact 
on
    > other queries as possible.
    >
    > Some things I have checked/tried:
    >
    > Searching for similar problems I found this thread:
    > 
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
    > Here it is suggested that FGF is not necessary (without any supporting
    > evidence). This goes directly against the documentation that states "If 
you use
    > [the SynonymGraphFilter] during indexing, you must follow it with a 
Flatten
    > Graph Filter":
    > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
    > Despite this warning I tried out removing the FGF on a local
    > cluster and indeed it still runs and this search now works, however I am
    > paranoid that this will break far more things than it fixes.
    >
    > I have tried adding the FGF as a filter to the query. This does not 
eliminate
    > the "can" term in the query analysis.
    >
    > I have tested other contracted words. Some have this issue as well - 
others do
    > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
    > preserve their tokens "won't" does not. I believe the pattern here is that
    > whenever part of the contraction has synonyms this problem manifests.
    >
    > Eliminating WDGF is not viable as we rely on this functionality for other 
uses
    > of delimiters (such as wi-fi -> wi fi).
    >
    > Performing WDGF after synonyms is also not viable as in the case that we 
have
    > the data "historical-text" we want this to match the search "history 
text".
    >
    > The hacky solution I have found is to use the PatternReplaceFilterFactory 
to
    > replace "can't" with "cant". Though this technically solves the issue, I 
hope it
    > is obvious why this does not feel like an ideal solution.
    >
    > Has anyone encountered this type of issue before? Any advice on how the 
filter
    > use here could be improved to handle this case?
    >
    > Thanks,
    > Eric Buss
    >
    >
    > PS. The verbose output from Analysis of "can't"
    >
    > Index
    >
    > ST    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > SF    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > WDGF  | text          | cant          | can't            | can        | t 
         |
    >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | 
[74]       |
    >       | start         | 0             | 0                | 0          | 4 
         |
    >       | end           | 5             | 5                | 3          | 5 
         |
    >       | positionLength| 2             | 2                | 1          | 1 
         |
    >       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
<ALPHANUM> |
    >       | termFrequency | 1             | 1                | 1          | 1 
         |
    >       | position      | 1             | 1                | 1          | 2 
         |
    >       | keyword       | false         | false            | false      | 
false      |
    > FGF   | text          | cant          | can't            | can        | t 
         |
    >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | 
[74]       |
    >       | start         | 0             | 0                | 0          | 4 
         |
    >       | end           | 5             | 5                | 3          | 5 
         |
    >       | positionLength| 2             | 2                | 1          | 1 
         |
    >       | type          | <ALPHANUM>    | <ALPHANUM>       | <ALPHANUM> | 
<ALPHANUM> |
    >       | termFrequency | 1             | 1                | 1          | 1 
         |
    >       | position      | 1             | 1                | 1          | 2 
         |
    >       | keyword       | false         | false            | false      | 
false      |
    > SGF   | text          | cants            | cant          | can't          
  | cans          | can        | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 
74] | [63 61 6e 73] | [63 61 6e] | [74]       |
    >       | start         | 0                | 0             | 0              
  | 0             | 0          | 4          |
    >       | end           | 5                | 5             | 5              
  | 3             | 3          | 5          |
    >       | positionLength| 1                | 1             | 2              
  | 1             | 1          | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>     
  | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1              
  | 1             | 1          | 1          |
    >       | position      | 1                | 1             | 1              
  | 3             | 3          | 4          |
    >       | keyword       | false            | false         | false          
  | false         | false      | false      |
    > FGF   | text          | cants            | cant          | can't          
  | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 
74] | [74]       |
    >       | start         | 0                | 0             | 0              
  | 4          |
    >       | end           | 5                | 5             | 5              
  | 5          |
    >       | positionLength| 1                | 1             | 1              
  | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>     
  | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1              
  | 1          |
    >       | position      | 1                | 1             | 1              
  | 3          |
    >       | keyword       | false            | false         | false          
  | false      |
    > ICUFF | text          | cants            | cant          | can't          
  | t          |
    >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e 27 
74] | [74]       |
    >       | start         | 0                | 0             | 0              
  | 4          |
    >       | end           | 5                | 5             | 5              
  | 5          |
    >       | positionLength| 1                | 1             | 1              
  | 1          |
    >       | type          | SYNONYM          | <ALPHANUM>    | <ALPHANUM>     
  | <ALPHANUM> |
    >       | termFrequency | 1                | 1             | 1              
  | 1          |
    >       | position      | 1                | 1             | 1              
  | 3          |
    >       | keyword       | false            | false         | false          
  | false      |
    >
    > Query
    >
    > ST    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > SF    | text          | can't            |
    >       | raw_bytes     | [63 61 6e 27 74] |
    >       | start         | 0                |
    >       | end           | 5                |
    >       | positionLength| 1                |
    >       | type          | <ALPHANUM>       |
    >       | termFrequency | 1                |
    >       | position      | 1                |
    > WDGF  | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    > SF    | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    > ICUFF | text          | can        | t          |
    >       | raw_bytes     | [63 61 6e] | [74]       |
    >       | start         | 0          | 4          |
    >       | end           | 3          | 5          |
    >       | positionLength| 1          | 1          |
    >       | type          | <ALPHANUM> | <ALPHANUM> |
    >       | termFrequency | 1          | 1          |
    >       | position      | 1          | 2          |
    >       | keyword       | false      | false      |
    >
    

Reply via email to