Many thanks Walter, that's useful information. And yes, if we are able to keep stopwords, then we will. We have been exploring it because we've noticed its use leads to a sizable drop in index size (5%, in some of our tests), which then had the knock on effect of better performance. (Also, unfortunately, we do not have the luxury of using super big machines/storage -- so it's always a balancing act for us.)
Best, Edd -------------------- Edward Turner On Tue, 10 Nov 2020 at 16:22, Walter Underwood <wun...@wunderwood.org> wrote: > By far the simplest solution is to leave stopwords in the index. That also > improves > relevance, because it becomes possible to search for “vitamin a” or “to be > or not to be”. > > Stopword remove was a performance and disk space hack from the 1960s. It > is no > longer needed. We were keeping stopwords in the index at Infoseek, back in > 1996. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Nov 10, 2020, at 1:16 AM, Edward Turner <eddtur...@gmail.com> wrote: > > > > Hi all, > > > > Okay, I've been doing more research about this problem and from what I > > understand, phrase queries + stopwords are known to have some > difficulties > > working together in some circumstances. > > > > E.g., > > > https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1 > > https://issues.apache.org/jira/browse/SOLR-6468 > > > > I was thinking about workarounds, but each solution I've attempted > doesn't > > quite work. > > > > Therefore, maybe one possible solution is to take a step back and > > preprocess index/query data going to Solr, something like: > > > > String wordsForSolr = removeStopWordsFrom("This is pretend index or query > > data") > > // wordsForSolr = "pretend index query data" > > > > Off the top of my head, this will by-pass position issues. > > > > I will give this a go, but was wondering whether this is something others > > have done? > > > > Best wishes, > > Edd > > > > -------------------- > > Edward Turner > > > > > > On Fri, 6 Nov 2020 at 13:58, Edward Turner <eddtur...@gmail.com> wrote: > > > >> Hi all, > >> > >> We are experiencing some unexpected behaviour for phrase queries which > we > >> believe might be related to the FlattenGraphFilterFactory and stopwords. > >> > >> Brief description: when performing a phrase query > >> "Molecular cloning and evolution of the" => we get expected hits > >> "Molecular cloning and evolution of the genes" => we get no hits > >> (unexpected behaviour) > >> > >> I think it's worthwhile adding the analyzers we use to help you see what > >> we're doing: > >> ------------ Analyzers ---------------- > >> <fieldType name="full_ci" class="solr.TextField" > >> sortMissingLast="true" omitNorms="true" positionIncrementGap="100"> > >> <analyzer type="index"> > >> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" > >> pattern="[- /()]+" /> > >> <filter class="solr.StopFilterFactory" words="stopwords.txt" > >> ignoreCase="true" /> > >> <filter class="solr.ASCIIFoldingFilterFactory" > >> preserveOriginal="false" /> > >> <filter class="solr.LowerCaseFilterFactory" /> > >> <filter class="solr.WordDelimiterGraphFilterFactory" > >> generateNumberParts="1" splitOnCaseChange="0" > preserveOriginal="0" > >> splitOnNumerics="0" stemEnglishPossessive="1" > >> generateWordParts="1" > >> catenateNumbers="0" catenateWords="1" catenateAll="1" /> > >> <filter class="solr.FlattenGraphFilterFactory" /> > >> </analyzer> > >> <analyzer type="query"> > >> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" > >> pattern="[- /()]+" /> > >> <filter class="solr.StopFilterFactory" words="stopwords.txt" > >> ignoreCase="true" /> > >> <filter class="solr.ASCIIFoldingFilterFactory" > >> preserveOriginal="false" /> > >> <filter class="solr.LowerCaseFilterFactory" /> > >> <filter class="solr.WordDelimiterGraphFilterFactory" > >> generateNumberParts="1" splitOnCaseChange="0" > preserveOriginal="0" > >> splitOnNumerics="0" stemEnglishPossessive="1" > >> generateWordParts="1" > >> catenateNumbers="0" catenateWords="0" catenateAll="0" /> > >> </analyzer> > >> </fieldType> > >> ------------ End of Analyzers ---------------- > >> > >> ------------ Stopwords ---------------- > >> We use the following stopwords: > >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, > not, > >> of, on, or, such, that, the, their, then, there, these, they, this, to, > >> was, will, with, which > >> ------------ End of Stopwords ---------------- > >> > >> ------------ Analysis Admin page output --------------- > >> ... And to see what's going on when we're indexing/querying, I created a > >> gist with an image of the (non-verbose) output of the analysis admin > page > >> for, index data/query, "Molecular cloning and evolution of the genes": > >> > >> > https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png > >> > >> Hopefully this link works, and you can see that the resulting terms and > >> positions are identical until the FlattenGraphFilterFactory step in the > >> "index" phase. > >> > >> Final stage of index analysis: > >> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes > >> > >> Final stage of query analysis: > >> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes > >> > >> The empty positions are because of stopwords (presumably) > >> ------------ End of Analysis Admin page output --------------- > >> > >> Main question: > >> Could someone explain why the FlattenGraphFilterFactory changes the > >> position of the "genes" token? From what we see, this happens after a, > >> "the" (but we've not checked exhaustively, and continue to test). > >> > >> Perhaps, we are doing something wrong in our analysis setup? > >> > >> Any help would be much appreciated -- getting phrase queries to work is > an > >> important use-case of ours. > >> > >> Kind regards and thank you in advance, > >> Edd > >> -------------------- > >> Edward Turner > >> > >