You must stay in the Javadoc section, there the examples are good, or the
reference guide: 
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

PVK COMMENT 1: 
        This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on 
the
radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
now severely degrade my result quality as opposed to
HyphenationCompoundWordTokenFilterFactory?


Almost, zaken -> zaak is already KP output, no need to input what the
stemmer will do for you. 

PVK COMMENT 2: 
        How do you know zaken -> zaak is already KP output? Is there a list
somewhere?
        
PVK COMMENT 3: 
I now have:

        <fieldType name="searchtext_nl" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
        
                
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>           
                
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                
                <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>          
                
                <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"/>
                
                
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                
      </analyzer>
      <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>            
        
                
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>           
                
                <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compounds_nl.txt"
         minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
onlyLongestMatch="true"/>
                 
                 <filter class="solr.StemmerOverrideFilterFactory"
dictionary="stemdict_nl.txt"/>

                 <filter class="solr.SnowballPorterFilterFactory" language="Kp"
protected="protwords_nl.txt"/>
                 
                
                <filter class="solr.ASCIIFoldingFilterFactory"/>                
 
      </analyzer>
    </fieldType>

I tested in admin UI (and yes, I restart Solr and reindex every time I make
a change):      
        
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
returns:
"hi there dieren zaak something else"
"hi there dier something else"

http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
returns
"hi there dierenzaak something else"

So I added "dieren" to compounds_nl.txt

Now on "title_search_global:(dieren zaak)" it returns:
<doc>
    <str name="title">hi there dieren zaak something else</str>
    <str name="id">115_3699638</str>
</doc>
<doc>
    <str name="title">hi there dier something else</str>
    <str name="id">115_3699637</str>
</doc>
<doc>
    <str name="title">hi there dierenzaak something else</str>
    <str name="id">115_3699639</str>
</doc>

So it's starting to look good! :-)

What I want to know, how can I have Solr consider "dierenzaak" to be of
higher importance than just "dier" in the above results?

Also I'm still not 100% sure what my addition of "dieren" to
compounds_nl.txt actually does...I assume
DictionaryCompoundWordTokenFilterFactory just looks for that exact string
and if it finds it, considers that a separate word? Correct?

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to