Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana Tue, 05 Nov 2019 23:38:08 -0800

Hi Walter,

The solr.StopFilter removes all tokens that are stopwords. Those words will
> not be in the index, so they can never match a query.



I think the OP's concern is different results when adding a stopword. I
think he's using the filter factory correctly - the query chain includes
the filter as well so it should remove "a" while querying.

 *@Guilherme*, please post results for both the query, the document in
result you are concerned about and post full result of analysis screen (for
both query and index).

On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> wrote:

> No.
>
> The solr.StopFilter removes all tokens that are stopwords. Those words
> will not be in the index, so they can never match a query.
>
> 1. Remove the lines with solr.StopFilter from every analysis chain in
> schema.xml.
> 2. Reload the collection, restart Solr, or whatever to read the new config.
> 3. Reindex all of the documents.
>
> When indexed with the new analysis chain, the stopwords will not be
> removed and they will be searchable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> >
> > Ok. I am kind a lost now.
> > If I open up the console > analysis and perform it, that's the final
> result.
> >  <Screenshot 2019-11-05 at 14.54.16.png>
> >
> > Your suggestion is: get rid of the <filter stopword.txt> in the
> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then
> add to solr. Is that correct ?
> >
> > Thanks David
> >
> >> On 5 Nov 2019, at 14:48, David Hastings <hastings.recurs...@gmail.com
> <mailto:hastings.recurs...@gmail.com>> wrote:
> >>
> >> Fwd to another server
> >>
> >> no,
> >>               <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>
> >> is still using stopwords and should be removed, in my opinion of course,
> >> based on your use case may be different, but i generally axe any
> reference
> >> to them at all
> >>
> >> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>
> >>> Thanks.
> >>> Haven't I done this here ?
> >>>  <fieldType name="text_field" class="solr.TextField"
> >>> positionIncrementGap="100" omitNorms="false" >
> >>>           <analyzer type="index">
> >>>               <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>               <filter class="solr.ClassicFilterFactory"/>
> >>>               <filter class="solr.LengthFilterFactory" min="2"
> max="20"/>
> >>>               <filter class="solr.LowerCaseFilterFactory"/>
> >>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords.txt"/>
> >>>           </analyzer>
> >>>
> >>>
> >>>> On 5 Nov 2019, at 14:15, David Hastings <hastings.recurs...@gmail.com
> <mailto:hastings.recurs...@gmail.com>>
> >>> wrote:
> >>>>
> >>>> Fwd to another server
> >>>>
> >>>> The first thing you should do is remove any reference to stop words
> and
> >>>> never use them, then re-index your data and try it again.
> >>>>
> >>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <gvit...@ebi.ac.uk
> <mailto:gvit...@ebi.ac.uk>>
> >>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am performing a search to match a name (text_field), however this
> term
> >>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
> >>> 'a'
> >>>>> then it works.
> >>>>> e.g
> >>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>> doesn't work:
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>> <
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>
> >>>>>
> >>>>> Search term: lymphoid and non-lymphoid cell
> >>>>> works:
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>> <
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>
> >>>>> interested in the first result
> >>>>>
> >>>>> schema.xml
> >>>>> <field name="name"                          type="text_field"
> >>>>> indexed="true"  stored="true"   omitNorms="false"   required="true"
> >>>>> multiValued="false"/>
> >>>>>
> >>>>>           <analyzer type="query">
> >>>>>               <tokenizer class="solr.PatternTokenizerFactory"
> >>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="[_]" replacement=" "/>
> >>>>>               <filter class="solr.LengthFilterFactory" min="2"
> >>> max="20"/>
> >>>>>               <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>               <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>> words="stopwords.txt"/>
> >>>>>           </analyzer>
> >>>>>
> >>>>>       <fieldType name="text_field" class="solr.TextField"
> >>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>           <analyzer type="index">
> >>>>>               <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>               <filter class="solr.ClassicFilterFactory"/>
> >>>>>               <filter class="solr.LengthFilterFactory" min="2"
> >>> max="20"/>
> >>>>>               <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>               <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>> words="stopwords.txt"/>
> >>>>>           </analyzer>
> >>>>>           <analyzer type="query">
> >>>>>               <tokenizer class="solr.PatternTokenizerFactory"
> >>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>               <filter class="solr.PatternReplaceFilterFactory"
> >>>>> pattern="[_]" replacement=" "/>
> >>>>>               <filter class="solr.LengthFilterFactory" min="2"
> >>> max="20"/>
> >>>>>               <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>               <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>> words="stopwords.txt"/>
> >>>>>           </analyzer>
> >>>>>       </fieldType>
> >>>>>
> >>>>> stopwords.txt
> >>>>> #Standard english stop words taken from Lucene's StopAnalyzer
> >>>>> a
> >>>>> b
> >>>>> c
> >>>>> ....
> >>>>> an
> >>>>> and
> >>>>> are
> >>>>>
> >>>>> Running SolR 6.6.2.
> >>>>>
> >>>>> Is there anything I could do to prevent this ?
> >>>>>
> >>>>> Thanks
> >>>>> Guilherme
> >>>
> >>>
> >
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to