Markus, 

Thanks again. Ok, 1 by 1:

StemmerOverride wants \t separated fields, that is probably the cause of the
AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
proper example listed. I recommend putting a decompounder before a stemmer,
and have an accent (or ICU) folder as one of the last filters. 

PVK COMMENT:
Looking for Decompounders and found a few links, btw a lot of the pages
these are linked to don't work.

https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages

http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
        https://wiki.apache.org/solr/LanguageAnalysis#Decompounding
                
https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory
                
my stemdict_nl.txt now contains (words separated by a single tab):
aachen  aach
aachener        aachener
aalmoezen       aalmoes
beveel  bevool
dierenzaken     dierenzaak

The problem before was indeed like @Shawn indicates that I had words in
there with a space like so:
dieren zaken    dierenzaak


        
About the diff, it looks like KP output, it has the same issues with whether
or not a word needs double or single vowels in the root. It also shows
issues with strong verbs/nouns (beveel/bevool). Having this list seems like
having KP configured so you should drop it, and only list exceptions to KP
rules in the dict file. This is not easy, so i recommend to stay in to your
domain's vocabulary. 

PVK COMMENT:
That's what I now did above right?


Also, unless you have a very specific need for it, drop the StopFilter.
Nobody in these days should want a StopFilter unless they can justify it. We
use them too, but only for very specific reasons, but never for text search.
You might also want to have a WordDelimiterFilter as your first filter, look
it up, you probably want to have it. 

PVK COMMENT:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?




This is what I have now:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
        
                
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>           
                
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>          
                
                
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
        
                
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>           
                
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                 
                 <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>

                
                <filter class="solr.ASCIIFoldingFilterFactory"/>                
 
      </analyzer>
    </fieldType>

        
Now for both this query
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true
        
and this one:
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
        
This result is found: 
"Hi there dieren zaak something else" 

And these are NOT: 
"Hi there dier something else" 
"Hi there dierenzaak something else" 
"Hi there dierzaak something else"      

What else do you recommend I try?
        



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to