Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Doris Peter Thu, 18 Jul 2019 02:01:54 -0700

Hi, 

another problem with the stemming:


Most of our texts are in German, so we use the GermanStemFilterFactory. But we 
also use MappingCharFilterFactory where we map for example ä->ae.

But of course we want the stemming to turn for example 'häuser' into 'haus', 
which the GermanStemFilterFactory should do, according to the documentation.

At the moment, my configuration looks like this:

    <fieldtype name="text_ocr" class="solr.TextField" termPositions="true" 
termVectors="true" termPayloads="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.GermanStemFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑"
          encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder" 
/>
        <filter class="solr.WordDelimiterGraphFilterFactory" 
protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1" 
stemEnglishPossessive="1"
             types="wdfftypes.txt" />
      </analyzer>
    </fieldtype>

So, Stemming before CharFilter.

But the Solr Analyzer says:

MCF 0 h a e u s e r

WT
        
text
raw_bytes
start
end
positionLength
type
termFrequency
position
        
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
LCF
        
text
raw_bytes
start
end
positionLength
type
termFrequency
position
        
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
GSF
        
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
        
haeu
[68 61 65 75]
0
6
1
word
1
1
false
DPTF
        
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
        
haeu
[68 61 65 75]
0
6
1
word
1
1
false
WDGF
        
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
        
haeu
[68 61 65 75]
0
6
1
word
1
1
false

So, the mappingCharFilter seems to be executed at first, no matter which 
position it has in the configuration?

Solr documentation also says, it should be put before the Tokenizer:
https://lucene.apache.org/solr/guide/7_6/charfilterfactories.html
"CharFilters can be chained like Token Filters and placed in front of a 
Tokenizer."

But if the word häuser is changed to haeuser, the stemmer doesn't stem the word 
anymore :-/

Is there a way to solve this problem?

Thanks a lot, Doris

Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Reply via email to