RE: Solr search engine configuration

Markus Jelsma Tue, 13 Mar 2018 07:45:12 -0700

 
-----Original message-----
> From:PeterKerk <petervdk...@hotmail.com>
> Sent: Tuesday 13th March 2018 14:24
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> Markus, 
> 
> Thanks again. Ok, 1 by 1:
> 
> StemmerOverride wants \t separated fields, that is probably the cause of the
> AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
> proper example listed. I recommend putting a decompounder before a stemmer,
> and have an accent (or ICU) folder as one of the last filters. 
> 
> PVK COMMENT:
> Looking for Decompounders and found a few links, btw a lot of the pages
> these are linked to don't work.
> 
> https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages
> 
> http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
>       https://wiki.apache.org/solr/LanguageAnalysis#Decompounding
>               
> https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory


You must stay in the Javadoc section, there the examples are good, or the 
reference guide:
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

>               
> my stemdict_nl.txt now contains (words separated by a single tab):
> aachen        aach
> aachener      aachener
> aalmoezen     aalmoes
> beveel        bevool
> dierenzaken   dierenzaak
> 
> The problem before was indeed like @Shawn indicates that I had words in
> there with a space like so:
> dieren zaken  dierenzaak
> 
> 
>       
> About the diff, it looks like KP output, it has the same issues with whether
> or not a word needs double or single vowels in the root. It also shows
> issues with strong verbs/nouns (beveel/bevool). Having this list seems like
> having KP configured so you should drop it, and only list exceptions to KP
> rules in the dict file. This is not easy, so i recommend to stay in to your
> domain's vocabulary. 
> 
> PVK COMMENT:
> That's what I now did above right?

Almost, zaken -> zaak is already KP output, no need to input what the stemmer 
will do for you.

> 
> 
> Also, unless you have a very specific need for it, drop the StopFilter.
> Nobody in these days should want a StopFilter unless they can justify it. We
> use them too, but only for very specific reasons, but never for text search.
> You might also want to have a WordDelimiterFilter as your first filter, look
> it up, you probably want to have it. 
> 
> PVK COMMENT:
> But without a Stopfilter, wont stopwords be included in searches? I though
> that for example Google excluded these words in their algorithms?
> 

Yes, stopwords are good! Keep them! And i am glad Google doesn't just strip 
stopwords.

> 
> 
> This is what I have now:
> 
>       <fieldType name="searchtext_nl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
>         
>               
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>               <filter class="solr.LowerCaseFilterFactory"/>           
>               
>               <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>               
>               <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>                
>               
>               
>               <filter class="solr.ASCIIFoldingFilterFactory"/>
>               
>       </analyzer>
>       <analyzer type="query">
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
>         
>               
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>               <filter class="solr.LowerCaseFilterFactory"/>           
>               
>               <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compounds_nl.txt"
>          minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>                
>                <filter class="solr.StemmerOverrideFilterFactory"
> dictionary="stemdict_nl.txt"/>
> 
>               
>               <filter class="solr.ASCIIFoldingFilterFactory"/>                
>  
>       </analyzer>
>     </fieldType>

That looks fine, but you now you omitted the stemmer (Snowball). Put it after 
StemmerOverrideFilter, and before ASCIIFolding.

> 
>       
> Now for both this query
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true
>       
> and this one:
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
>       
> This result is found: 
> "Hi there dieren zaak something else" 
> 
> And these are NOT: 
> "Hi there dier something else" 
> "Hi there dierenzaak something else" 
> "Hi there dierzaak something else"    

This is because the decompounder doesn't split dierenzaak, just must test this 
in Solradmin before reindexing or trying. Once the decompounder splits 
dierenzaak, and a stemmer is in place, all except 'dier' will be found, 
depending on your mm-setting.

And did you reindex? 

> 
> What else do you recommend I try?
>       
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

RE: Solr search engine configuration

Reply via email to