Hi all, Okay, I've been doing more research about this problem and from what I understand, phrase queries + stopwords are known to have some difficulties working together in some circumstances.
E.g., https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1 https://issues.apache.org/jira/browse/SOLR-6468 I was thinking about workarounds, but each solution I've attempted doesn't quite work. Therefore, maybe one possible solution is to take a step back and preprocess index/query data going to Solr, something like: String wordsForSolr = removeStopWordsFrom("This is pretend index or query data") // wordsForSolr = "pretend index query data" Off the top of my head, this will by-pass position issues. I will give this a go, but was wondering whether this is something others have done? Best wishes, Edd -------------------- Edward Turner On Fri, 6 Nov 2020 at 13:58, Edward Turner <eddtur...@gmail.com> wrote: > Hi all, > > We are experiencing some unexpected behaviour for phrase queries which we > believe might be related to the FlattenGraphFilterFactory and stopwords. > > Brief description: when performing a phrase query > "Molecular cloning and evolution of the" => we get expected hits > "Molecular cloning and evolution of the genes" => we get no hits > (unexpected behaviour) > > I think it's worthwhile adding the analyzers we use to help you see what > we're doing: > ------------ Analyzers ---------------- > <fieldType name="full_ci" class="solr.TextField" > sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.SimplePatternSplitTokenizerFactory" > pattern="[- /()]+" /> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="true" /> > <filter class="solr.ASCIIFoldingFilterFactory" > preserveOriginal="false" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.WordDelimiterGraphFilterFactory" > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0" > splitOnNumerics="0" stemEnglishPossessive="1" > generateWordParts="1" > catenateNumbers="0" catenateWords="1" catenateAll="1" /> > <filter class="solr.FlattenGraphFilterFactory" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.SimplePatternSplitTokenizerFactory" > pattern="[- /()]+" /> > <filter class="solr.StopFilterFactory" words="stopwords.txt" > ignoreCase="true" /> > <filter class="solr.ASCIIFoldingFilterFactory" > preserveOriginal="false" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.WordDelimiterGraphFilterFactory" > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0" > splitOnNumerics="0" stemEnglishPossessive="1" > generateWordParts="1" > catenateNumbers="0" catenateWords="0" catenateAll="0" /> > </analyzer> > </fieldType> > ------------ End of Analyzers ---------------- > > ------------ Stopwords ---------------- > We use the following stopwords: > a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, > of, on, or, such, that, the, their, then, there, these, they, this, to, > was, will, with, which > ------------ End of Stopwords ---------------- > > ------------ Analysis Admin page output --------------- > ... And to see what's going on when we're indexing/querying, I created a > gist with an image of the (non-verbose) output of the analysis admin page > for, index data/query, "Molecular cloning and evolution of the genes": > > https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png > > Hopefully this link works, and you can see that the resulting terms and > positions are identical until the FlattenGraphFilterFactory step in the > "index" phase. > > Final stage of index analysis: > (1)molecular (2)cloning (3) (4)evolution (5) (6)genes > > Final stage of query analysis: > (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes > > The empty positions are because of stopwords (presumably) > ------------ End of Analysis Admin page output --------------- > > Main question: > Could someone explain why the FlattenGraphFilterFactory changes the > position of the "genes" token? From what we see, this happens after a, > "the" (but we've not checked exhaustively, and continue to test). > > Perhaps, we are doing something wrong in our analysis setup? > > Any help would be much appreciated -- getting phrase queries to work is an > important use-case of ours. > > Kind regards and thank you in advance, > Edd > -------------------- > Edward Turner >