Hi Michael,

I think you only want to use FlattenGraphFilter *once* in the indexing
> analysis chain


I had been doing this for a long time before I finally shifted to use FGF
after every GraphFilterFactory. Although I don't know much about it on the
code level, are you sure that all the following filters will be able to
consume graph in case we don't use FGF after a graph factory?

On Fri, 6 Dec 2019 at 01:22, Eric Buss <ericb...@abebooks.com> wrote:

> Thanks for the reply,
>
> I wouldn't be surprised if the issue you linked is related, I also found
> another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723
>
> You are absolutely right that the FlattenGraphFilter should only be used
> once, but as you noted the issue I am experiencing seems unrelated.
>
> On 2019-12-05, 10:23 AM, "Michael Gibney" <mich...@michaelgibney.net>
> wrote:
>
>     I wonder if this might be similar/related to the underlying problem
>     that is intended to be addressed by
>     https://issues.apache.org/jira/browse/LUCENE-8985?
>
>     btw, I think you only want to use FlattenGraphFilter *once* in the
>     indexing analysis chain, towards the end (after all components that
>     emit graphs). ...though that's probably *not* what's causing the
>     problem (based on the fact that the extra FGF doesn't seem to modify
>     any attributes).
>
>
>
>     On Mon, Nov 25, 2019 at 2:19 PM Eric Buss <ericb...@abebooks.com>
> wrote:
>     >
>     > Hi all,
>     >
>     > I have been trying to solve an issue where FlattenGraphFilter (FGF)
> removes
>     > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently
> searches that
>     > contain the contraction "can't" do not match.
>     >
>     > This is on Solr version 7.7.1.
>     >
>     > The field in question is defined as follows:
>     >
>     > <field name="myField" type="text_general" indexed="true"
> stored="true"/>
>     >
>     > And the relevant fieldType "text_general":
>     >
>     > <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>     >     <analyzer type="index">
>     >         <tokenizer class="solr.StandardTokenizerFactory"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter class="solr.WordDelimiterGraphFilterFactory"
> stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1"
> splitOnCaseChange="0"/>
>     >         <filter class="solr.FlattenGraphFilterFactory"/>
>     >         <filter class="solr.SynonymGraphFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>     >         <filter class="solr.FlattenGraphFilterFactory"/>
>     >         <filter
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     >     </analyzer>
>     >     <analyzer type="query">
>     >         <tokenizer class="solr.StandardTokenizerFactory"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter class="solr.WordDelimiterGraphFilterFactory"
> stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0"
> splitOnCaseChange="0"/>
>     >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>     >         <filter
> class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
>     >     </analyzer>
>     > </fieldType>
>     >
>     > Finally, the relevant entries in synonyms.txt are:
>     >
>     > can,cans
>     > cants,cant
>     >
>     > Using the Solr console Analysis and "can't" as the Field Value, the
> following
>     > tokens are produced (find the verbose output at the bottom of this
> email):
>     >
>     > Index
>     > ST    | can't
>     > SF    | can't
>     > WDGF  | cant | can't | can | t
>     > FGF   | cant | can't | can | t
>     > SGF   | cants | cant | can't | | cans | can | t
>     > ICUFF | cants | cant | can't | | cans | can | t
>     > FGF   | cants | cant | can't | | t
>     >
>     > Query
>     > ST    | can't
>     > SF    | can't
>     > WDGF  | can | t
>     > SF    | can | t
>     > ICUFF | can | t
>     >
>     > As you can see after the FGF the tokens "can" and "cans" are pruned
> so the query
>     > does not match. Is there a reasonable way to preserve these tokens?
>     >
>     > My key concern is that I want the "fix" for this to have as little
> impact on
>     > other queries as possible.
>     >
>     > Some things I have checked/tried:
>     >
>     > Searching for similar problems I found this thread:
>     >
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
>     > Here it is suggested that FGF is not necessary (without any
> supporting
>     > evidence). This goes directly against the documentation that states
> "If you use
>     > [the SynonymGraphFilter] during indexing, you must follow it with a
> Flatten
>     > Graph Filter":
>     > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
>     > Despite this warning I tried out removing the FGF on a local
>     > cluster and indeed it still runs and this search now works, however
> I am
>     > paranoid that this will break far more things than it fixes.
>     >
>     > I have tried adding the FGF as a filter to the query. This does not
> eliminate
>     > the "can" term in the query analysis.
>     >
>     > I have tested other contracted words. Some have this issue as well -
> others do
>     > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't"
> all
>     > preserve their tokens "won't" does not. I believe the pattern here
> is that
>     > whenever part of the contraction has synonyms this problem manifests.
>     >
>     > Eliminating WDGF is not viable as we rely on this functionality for
> other uses
>     > of delimiters (such as wi-fi -> wi fi).
>     >
>     > Performing WDGF after synonyms is also not viable as in the case
> that we have
>     > the data "historical-text" we want this to match the search "history
> text".
>     >
>     > The hacky solution I have found is to use the
> PatternReplaceFilterFactory to
>     > replace "can't" with "cant". Though this technically solves the
> issue, I hope it
>     > is obvious why this does not feel like an ideal solution.
>     >
>     > Has anyone encountered this type of issue before? Any advice on how
> the filter
>     > use here could be improved to handle this case?
>     >
>     > Thanks,
>     > Eric Buss
>     >
>     >
>     > PS. The verbose output from Analysis of "can't"
>     >
>     > Index
>     >
>     > ST    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > SF    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > WDGF  | text          | cant          | can't            | can
>   | t          |
>     >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61
> 6e] | [74]       |
>     >       | start         | 0             | 0                | 0
>   | 4          |
>     >       | end           | 5             | 5                | 3
>   | 5          |
>     >       | positionLength| 2             | 2                | 1
>   | 1          |
>     >       | type          | <ALPHANUM>    | <ALPHANUM>       |
> <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1             | 1                | 1
>   | 1          |
>     >       | position      | 1             | 1                | 1
>   | 2          |
>     >       | keyword       | false         | false            | false
>   | false      |
>     > FGF   | text          | cant          | can't            | can
>   | t          |
>     >       | raw_bytes     | [63 61 6e 74] | [63 61 6e 27 74] | [63 61
> 6e] | [74]       |
>     >       | start         | 0             | 0                | 0
>   | 4          |
>     >       | end           | 5             | 5                | 3
>   | 5          |
>     >       | positionLength| 2             | 2                | 1
>   | 1          |
>     >       | type          | <ALPHANUM>    | <ALPHANUM>       |
> <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1             | 1                | 1
>   | 1          |
>     >       | position      | 1             | 1                | 1
>   | 2          |
>     >       | keyword       | false         | false            | false
>   | false      |
>     > SGF   | text          | cants            | cant          | can't
>         | cans          | can        | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [63 61 6e 73] | [63 61 6e] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 0             | 0          | 4          |
>     >       | end           | 5                | 5             | 5
>         | 3             | 3          | 5          |
>     >       | positionLength| 1                | 1             | 2
>         | 1             | 1          | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | SYNONYM       | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1             | 1          | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3             | 3          | 4          |
>     >       | keyword       | false            | false         | false
>         | false         | false      | false      |
>     > FGF   | text          | cants            | cant          | can't
>         | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 4          |
>     >       | end           | 5                | 5             | 5
>         | 5          |
>     >       | positionLength| 1                | 1             | 1
>         | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3          |
>     >       | keyword       | false            | false         | false
>         | false      |
>     > ICUFF | text          | cants            | cant          | can't
>         | t          |
>     >       | raw_bytes     | [63 61 6e 74 73] | [63 61 6e 74] | [63 61 6e
> 27 74] | [74]       |
>     >       | start         | 0                | 0             | 0
>         | 4          |
>     >       | end           | 5                | 5             | 5
>         | 5          |
>     >       | positionLength| 1                | 1             | 1
>         | 1          |
>     >       | type          | SYNONYM          | <ALPHANUM>    |
> <ALPHANUM>       | <ALPHANUM> |
>     >       | termFrequency | 1                | 1             | 1
>         | 1          |
>     >       | position      | 1                | 1             | 1
>         | 3          |
>     >       | keyword       | false            | false         | false
>         | false      |
>     >
>     > Query
>     >
>     > ST    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > SF    | text          | can't            |
>     >       | raw_bytes     | [63 61 6e 27 74] |
>     >       | start         | 0                |
>     >       | end           | 5                |
>     >       | positionLength| 1                |
>     >       | type          | <ALPHANUM>       |
>     >       | termFrequency | 1                |
>     >       | position      | 1                |
>     > WDGF  | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     > SF    | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     > ICUFF | text          | can        | t          |
>     >       | raw_bytes     | [63 61 6e] | [74]       |
>     >       | start         | 0          | 4          |
>     >       | end           | 3          | 5          |
>     >       | positionLength| 1          | 1          |
>     >       | type          | <ALPHANUM> | <ALPHANUM> |
>     >       | termFrequency | 1          | 1          |
>     >       | position      | 1          | 2          |
>     >       | keyword       | false      | false      |
>     >
>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Reply via email to