Hi all,

Okay, I've been doing more research about this problem and from what I
understand, phrase queries + stopwords are known to have some difficulties
working together in some circumstances.

E.g.,
https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
https://issues.apache.org/jira/browse/SOLR-6468

I was thinking about workarounds, but each solution I've attempted doesn't
quite work.

Therefore, maybe one possible solution is to take a step back and
preprocess index/query data going to Solr, something like:

String wordsForSolr = removeStopWordsFrom("This is pretend index or query
data")
// wordsForSolr = "pretend index query data"

Off the top of my head, this will by-pass position issues.

I will give this a go, but was wondering whether this is something others
have done?

Best wishes,
Edd

--------------------
Edward Turner


On Fri, 6 Nov 2020 at 13:58, Edward Turner <eddtur...@gmail.com> wrote:

> Hi all,
>
> We are experiencing some unexpected behaviour for phrase queries which we
> believe might be related to the FlattenGraphFilterFactory and stopwords.
>
> Brief description: when performing a phrase query
> "Molecular cloning and evolution of the" => we get expected hits
> "Molecular cloning and evolution of the genes" => we get no hits
> (unexpected behaviour)
>
> I think it's worthwhile adding the analyzers we use to help you see what
> we're doing:
> ------------ Analyzers ----------------
> <fieldType name="full_ci" class="solr.TextField"
>    sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>    <analyzer type="index">
>       <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
>          pattern="[- /()]+" />
>       <filter class="solr.StopFilterFactory" words="stopwords.txt"
>          ignoreCase="true" />
>       <filter class="solr.ASCIIFoldingFilterFactory"
>          preserveOriginal="false" />
>       <filter class="solr.LowerCaseFilterFactory" />
>       <filter class="solr.WordDelimiterGraphFilterFactory"
>          generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>          splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>          catenateNumbers="0" catenateWords="1" catenateAll="1" />
>       <filter class="solr.FlattenGraphFilterFactory" />
>    </analyzer>
>    <analyzer type="query">
>       <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
>          pattern="[- /()]+" />
>       <filter class="solr.StopFilterFactory" words="stopwords.txt"
>          ignoreCase="true" />
>       <filter class="solr.ASCIIFoldingFilterFactory"
>          preserveOriginal="false" />
>       <filter class="solr.LowerCaseFilterFactory" />
>       <filter class="solr.WordDelimiterGraphFilterFactory"
>          generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>          splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>          catenateNumbers="0" catenateWords="0" catenateAll="0" />
>    </analyzer>
> </fieldType>
> ------------ End of Analyzers ----------------
>
> ------------ Stopwords ----------------
> We use the following stopwords:
> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
> of, on, or, such, that, the, their, then, there, these, they, this, to,
> was, will, with, which
> ------------ End of Stopwords ----------------
>
> ------------ Analysis Admin page output ---------------
> ... And to see what's going on when we're indexing/querying, I created a
> gist with an image of the (non-verbose) output of the analysis admin page
> for, index data/query, "Molecular cloning and evolution of the genes":
>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>
> Hopefully this link works, and you can see that the resulting terms and
> positions are identical until the FlattenGraphFilterFactory step in the
> "index" phase.
>
> Final stage of index analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>
> Final stage of query analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>
> The empty positions are because of stopwords (presumably)
> ------------ End of Analysis Admin page output ---------------
>
> Main question:
> Could someone explain why the FlattenGraphFilterFactory changes the
> position of the "genes" token? From what we see, this happens after a,
> "the" (but we've not checked exhaustively, and continue to test).
>
> Perhaps, we are doing something wrong in our analysis setup?
>
> Any help would be much appreciated -- getting phrase queries to work is an
> important use-case of ours.
>
> Kind regards and thank you in advance,
> Edd
> --------------------
> Edward Turner
>

Reply via email to