Many thanks Walter, that's useful information. And yes, if we are able to
keep stopwords, then we will. We have been exploring it because we've
noticed its use leads to a sizable drop in index size (5%, in some of our
tests), which then had the knock on effect of better performance. (Also,
unfortunately, we do not have the luxury of using super big
machines/storage -- so it's always a balancing act for us.)

Best,
Edd
--------------------
Edward Turner


On Tue, 10 Nov 2020 at 16:22, Walter Underwood <wun...@wunderwood.org>
wrote:

> By far the simplest solution is to leave stopwords in the index. That also
> improves
> relevance, because it becomes possible to search for “vitamin a” or “to be
> or not to be”.
>
> Stopword remove was a performance and disk space hack from the 1960s. It
> is no
> longer needed. We were keeping stopwords in the index at Infoseek, back in
> 1996.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 10, 2020, at 1:16 AM, Edward Turner <eddtur...@gmail.com> wrote:
> >
> > Hi all,
> >
> > Okay, I've been doing more research about this problem and from what I
> > understand, phrase queries + stopwords are known to have some
> difficulties
> > working together in some circumstances.
> >
> > E.g.,
> >
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> > https://issues.apache.org/jira/browse/SOLR-6468
> >
> > I was thinking about workarounds, but each solution I've attempted
> doesn't
> > quite work.
> >
> > Therefore, maybe one possible solution is to take a step back and
> > preprocess index/query data going to Solr, something like:
> >
> > String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> > data")
> > // wordsForSolr = "pretend index query data"
> >
> > Off the top of my head, this will by-pass position issues.
> >
> > I will give this a go, but was wondering whether this is something others
> > have done?
> >
> > Best wishes,
> > Edd
> >
> > --------------------
> > Edward Turner
> >
> >
> > On Fri, 6 Nov 2020 at 13:58, Edward Turner <eddtur...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> We are experiencing some unexpected behaviour for phrase queries which
> we
> >> believe might be related to the FlattenGraphFilterFactory and stopwords.
> >>
> >> Brief description: when performing a phrase query
> >> "Molecular cloning and evolution of the" => we get expected hits
> >> "Molecular cloning and evolution of the genes" => we get no hits
> >> (unexpected behaviour)
> >>
> >> I think it's worthwhile adding the analyzers we use to help you see what
> >> we're doing:
> >> ------------ Analyzers ----------------
> >> <fieldType name="full_ci" class="solr.TextField"
> >>   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>   <analyzer type="index">
> >>      <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
> >>         pattern="[- /()]+" />
> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >>         ignoreCase="true" />
> >>      <filter class="solr.ASCIIFoldingFilterFactory"
> >>         preserveOriginal="false" />
> >>      <filter class="solr.LowerCaseFilterFactory" />
> >>      <filter class="solr.WordDelimiterGraphFilterFactory"
> >>         generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >>         splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >>         catenateNumbers="0" catenateWords="1" catenateAll="1" />
> >>      <filter class="solr.FlattenGraphFilterFactory" />
> >>   </analyzer>
> >>   <analyzer type="query">
> >>      <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
> >>         pattern="[- /()]+" />
> >>      <filter class="solr.StopFilterFactory" words="stopwords.txt"
> >>         ignoreCase="true" />
> >>      <filter class="solr.ASCIIFoldingFilterFactory"
> >>         preserveOriginal="false" />
> >>      <filter class="solr.LowerCaseFilterFactory" />
> >>      <filter class="solr.WordDelimiterGraphFilterFactory"
> >>         generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >>         splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >>         catenateNumbers="0" catenateWords="0" catenateAll="0" />
> >>   </analyzer>
> >> </fieldType>
> >> ------------ End of Analyzers ----------------
> >>
> >> ------------ Stopwords ----------------
> >> We use the following stopwords:
> >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
> not,
> >> of, on, or, such, that, the, their, then, there, these, they, this, to,
> >> was, will, with, which
> >> ------------ End of Stopwords ----------------
> >>
> >> ------------ Analysis Admin page output ---------------
> >> ... And to see what's going on when we're indexing/querying, I created a
> >> gist with an image of the (non-verbose) output of the analysis admin
> page
> >> for, index data/query, "Molecular cloning and evolution of the genes":
> >>
> >>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
> >>
> >> Hopefully this link works, and you can see that the resulting terms and
> >> positions are identical until the FlattenGraphFilterFactory step in the
> >> "index" phase.
> >>
> >> Final stage of index analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
> >>
> >> Final stage of query analysis:
> >> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
> >>
> >> The empty positions are because of stopwords (presumably)
> >> ------------ End of Analysis Admin page output ---------------
> >>
> >> Main question:
> >> Could someone explain why the FlattenGraphFilterFactory changes the
> >> position of the "genes" token? From what we see, this happens after a,
> >> "the" (but we've not checked exhaustively, and continue to test).
> >>
> >> Perhaps, we are doing something wrong in our analysis setup?
> >>
> >> Any help would be much appreciated -- getting phrase queries to work is
> an
> >> important use-case of ours.
> >>
> >> Kind regards and thank you in advance,
> >> Edd
> >> --------------------
> >> Edward Turner
> >>
>
>

Reply via email to