Re: Changing the order of stemming and stopwords cleaning in techproducts config

H . Güven Candoğan Wed, 10 Nov 2021 02:04:12 -0800

Thanks all for the answers, appreciate it!

I am happy to contribute. Feel free to assign the ticket to me.


Best
Guven

On Tue, Nov 9, 2021 at 12:31 PM Eric Pugh <[email protected]>
wrote:

> https://issues.apache.org/jira/browse/SOLR-15779
>
> Feel free to weigh in!
>
> > On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C]
> <[email protected]> wrote:
> >
> > I cannot agree more.  On the product provided by www.indexengines.com,
> we stopped using stopwords when we noted that first names that would be
> flagged as such by Named Entity Recognition would also be categorized as
> stopwords in some language.  Namely - the key developers Ben and Dan
> (speaking).
> >
> > On 11/8/21, 10:58 AM, "Markus Jelsma" <[email protected]>
> wrote:
> >
> >    Hello Güven,
> >
> >    You should consider not using stopwords at all. The filter is useless
> or
> >    problematic in almost all cases. If you want to avoid trouble, drop
> the
> >    filter, because:
> >
> >    * Due to modern compression rates, the memory/disk space the filter
> clears
> >    up is negligible.
> >    * The scoring, tf*idf, gives low scores for high frequency terms.
> >    * At some point, a product's name or specification/type/brand will
> contain
> >    one or more stopwords. This is inevitable!
> >
> >    Regards,
> >    Markus
> >
> >    Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <
> [email protected]>:
> >
> >> Hi all,
> >>
> >> We are experimenting with the sample techproducts schema
> >> <
> >>
> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
> >>>
> >> from
> >> the Apache Solr master repo.
> >>
> >> We realized that having the stemming(PorterStemFilterFactory) filter
> after
> >> the stopword filter(StopFilterFactory) seems to create issues.
> >>
> >> For example, we added “what” to the stopword list and we noticed that
> for
> >> the input “what’s in the box”,  we end up with “what box” after
> stemming.
> >> However, we would want to have only the word “box” at the end of this
> >> process. This desired result “box” can only be achieved when the
> stopwords
> >> filter is placed after the stemming. Additionally, having the stopwords
> >> filter after lowercasing and stemming seems to create better stopfilter
> >> performance. At the end, we ended up with the following order in our
> >> configuration:
> >>
> >>
> >>   1. LowerCaseFilterFactory
> >>   2. PorterStemFilterFactory
> >>   3. StopFilterFactory
> >>
> >>
> >> Since we are new to the Apache Solr and we are using what it seems a
> >> “default” configuration, we fear that we might be missing some important
> >> context here. Is there a justification for the default ordering, which I
> >> assume most people will use as-is, and that we might be missing? Do you
> see
> >> any issues placing the stopwords filter after stemming? Do you see any
> >> issues placing the lowercasing before stopwords filter and stemming?
> >>
> >> Regards,
> >> Guven
> >>
> >
>
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

-- 
H. Güven Candoğan

Re: Changing the order of stemming and stopwords cleaning in techproducts config

Reply via email to