-----Original message----- > From:PeterKerk <petervdk...@hotmail.com> > Sent: Tuesday 13th March 2018 14:24 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > Markus, > > Thanks again. Ok, 1 by 1: > > StemmerOverride wants \t separated fields, that is probably the cause of the > AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a > proper example listed. I recommend putting a decompounder before a stemmer, > and have an accent (or ICU) folder as one of the last filters. > > PVK COMMENT: > Looking for Decompounders and found a few links, btw a lot of the pages > these are linked to don't work. > > https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages > > http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html > https://wiki.apache.org/solr/LanguageAnalysis#Decompounding > > https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory
You must stay in the Javadoc section, there the examples are good, or the reference guide: https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions > > my stemdict_nl.txt now contains (words separated by a single tab): > aachen aach > aachener aachener > aalmoezen aalmoes > beveel bevool > dierenzaken dierenzaak > > The problem before was indeed like @Shawn indicates that I had words in > there with a space like so: > dieren zaken dierenzaak > > > > About the diff, it looks like KP output, it has the same issues with whether > or not a word needs double or single vowels in the root. It also shows > issues with strong verbs/nouns (beveel/bevool). Having this list seems like > having KP configured so you should drop it, and only list exceptions to KP > rules in the dict file. This is not easy, so i recommend to stay in to your > domain's vocabulary. > > PVK COMMENT: > That's what I now did above right? Almost, zaken -> zaak is already KP output, no need to input what the stemmer will do for you. > > > Also, unless you have a very specific need for it, drop the StopFilter. > Nobody in these days should want a StopFilter unless they can justify it. We > use them too, but only for very specific reasons, but never for text search. > You might also want to have a WordDelimiterFilter as your first filter, look > it up, you probably want to have it. > > PVK COMMENT: > But without a Stopfilter, wont stopwords be included in searches? I though > that for example Google excluded these words in their algorithms? > Yes, stopwords are good! Keep them! And i am glad Google doesn't just strip stopwords. > > > This is what I have now: > > <fieldType name="searchtext_nl" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.DictionaryCompoundWordTokenFilterFactory" > dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > > <filter class="solr.StemmerOverrideFilterFactory" > dictionary="stemdict_nl.txt"/> > > > <filter class="solr.ASCIIFoldingFilterFactory"/> > > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.DictionaryCompoundWordTokenFilterFactory" > dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > > <filter class="solr.StemmerOverrideFilterFactory" > dictionary="stemdict_nl.txt"/> > > > <filter class="solr.ASCIIFoldingFilterFactory"/> > > </analyzer> > </fieldType> That looks fine, but you now you omitted the stemmer (Snowball). Put it after StemmerOverrideFilter, and before ASCIIFolding. > > > Now for both this query > http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true > > and this one: > http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true > > This result is found: > "Hi there dieren zaak something else" > > And these are NOT: > "Hi there dier something else" > "Hi there dierenzaak something else" > "Hi there dierzaak something else" This is because the decompounder doesn't split dierenzaak, just must test this in Solradmin before reindexing or trying. Once the decompounder splits dierenzaak, and a stemmer is in place, all except 'dier' will be found, depending on your mm-setting. And did you reindex? > > What else do you recommend I try? > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >