Thanks all for the answers, appreciate it! I am happy to contribute. Feel free to assign the ticket to me.
Best Guven On Tue, Nov 9, 2021 at 12:31 PM Eric Pugh <[email protected]> wrote: > https://issues.apache.org/jira/browse/SOLR-15779 > > Feel free to weigh in! > > > On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C] > <[email protected]> wrote: > > > > I cannot agree more. On the product provided by www.indexengines.com, > we stopped using stopwords when we noted that first names that would be > flagged as such by Named Entity Recognition would also be categorized as > stopwords in some language. Namely - the key developers Ben and Dan > (speaking). > > > > On 11/8/21, 10:58 AM, "Markus Jelsma" <[email protected]> > wrote: > > > > Hello Güven, > > > > You should consider not using stopwords at all. The filter is useless > or > > problematic in almost all cases. If you want to avoid trouble, drop > the > > filter, because: > > > > * Due to modern compression rates, the memory/disk space the filter > clears > > up is negligible. > > * The scoring, tf*idf, gives low scores for high frequency terms. > > * At some point, a product's name or specification/type/brand will > contain > > one or more stopwords. This is inevitable! > > > > Regards, > > Markus > > > > Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan < > [email protected]>: > > > >> Hi all, > >> > >> We are experimenting with the sample techproducts schema > >> < > >> > https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459 > >>> > >> from > >> the Apache Solr master repo. > >> > >> We realized that having the stemming(PorterStemFilterFactory) filter > after > >> the stopword filter(StopFilterFactory) seems to create issues. > >> > >> For example, we added “what” to the stopword list and we noticed that > for > >> the input “what’s in the box”, we end up with “what box” after > stemming. > >> However, we would want to have only the word “box” at the end of this > >> process. This desired result “box” can only be achieved when the > stopwords > >> filter is placed after the stemming. Additionally, having the stopwords > >> filter after lowercasing and stemming seems to create better stopfilter > >> performance. At the end, we ended up with the following order in our > >> configuration: > >> > >> > >> 1. LowerCaseFilterFactory > >> 2. PorterStemFilterFactory > >> 3. StopFilterFactory > >> > >> > >> Since we are new to the Apache Solr and we are using what it seems a > >> “default” configuration, we fear that we might be missing some important > >> context here. Is there a justification for the default ordering, which I > >> assume most people will use as-is, and that we might be missing? Do you > see > >> any issues placing the stopwords filter after stemming? Do you see any > >> issues placing the lowercasing before stopwords filter and stemming? > >> > >> Regards, > >> Guven > >> > > > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com < > http://www.opensourceconnections.com/> | My Free/Busy < > http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > > -- H. Güven Candoğan
