Removing stopwords is a dumb requirement. “Doctor, it hurts when I shove hedgehogs up my arse.”
Part of our job as search engineers is to solve the real problem, not implement a pile of requirements from people who don’t understand how search works. Here is an article I wrote 13 years ago about why we didn’t remove stopwords at Netflix. https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/ wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 30, 2020, at 8:56 AM, Permakoff, Vadim <vadim.permak...@verisk.com> > wrote: > > Hi Erik, > That's what I did in the past, but this is an enterprise search and I have a > requirement to remove the stopwords. > To have both features I can add synonyms in the front-end application, I know > it will work, but I need a justification why I have to do it in the > application as it is an additional effort. > I thought there is a bug for such case to which I can refer, because > according to documentation it should work, right? > Anyway, there is more to it. If I'll add the same synonym processing to the > indexing part, i.e. the configuration will be like this: > > <fieldType name="text_test" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="true"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" > ignoreCase="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > The analysis shows the parsing is matching now for indexing and querying > path, but the exact match result still cannot be found! This is weird. > Any thoughts? > > Best Regards, > Vadim Permakoff > > > -----Original Message----- > From: Erick Erickson <erickerick...@gmail.com> > Sent: Monday, June 29, 2020 10:19 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Looks like you’re removing stopwords. Stopwords cause issues like this with > the positions being off. > > It’s becoming more and more common to _NOT_ remove stopwords, is that an > option? > > > > Best, > Erick > >> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim <vadim.permak...@verisk.com> >> wrote: >> >> Hi Shawn, >> Many thanks for the response, I checked the field and it is correct. Let's >> call it _text_ to make it easier. >> I believe the parsing is also correct, please see below: >> - Query without quotes (works): >> "querystring":"expand the methods", >> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >> _text_:methods", >> >> - Query with quotes (does not work): >> "querystring":"\"expand the methods\"", >> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow, >> _text_:up], 0, true), _text_:expand]), _text_:methods], 0, true))", >> >> The document has text: >> "to expand the methods for mailing cancellation" >> >> The analysis on this field shows that all words are present in the index and >> the query, the order is also correct, but the word "methods" in moved one >> position, I guess that's why the result is not found. >> >> Best Regards, >> Vadim Permakoff >> >> >> >> >> -----Original Message----- >> From: Shawn Heisey <apa...@elyograg.org> >> Sent: Monday, June 29, 2020 6:28 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >> >> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >>> The basic query q=expand the methods <<< finds the document, >>> the query (in quotes) q="expand the methods" <<< cannot find the document >>> >>> Am I doing something wrong, or is it known bug (I saw similar issues >>> discussed in the past, but not for exact match query) and if yes - what is >>> the Jira for it? >> >> The most helpful information will come from running both queries with debug >> enabled, so you can see how the query is parsed. If you add a parameter >> "debugQuery=true" to the URL, then the response should include the parsed >> query. Compare those, and see if you can tell what the differences are. >> >> One of the most common problems for queries like this is that you're not >> searching the field that you THINK you're searching. I don't know whether >> this is the problem, I just mention it because it is a common error. >> >> Thanks, >> Shawn >> >> ________________________________ >> >> This email is intended solely for the recipient. It may contain privileged, >> proprietary or confidential information or material. If you are not the >> intended recipient, please delete this email and any attachments and notify >> the sender of the error. >