Hi,
another problem with the stemming:
Most of our texts are in German, so we use the GermanStemFilterFactory. But we
also use MappingCharFilterFactory where we map for example ä->ae.
But of course we want the stemming to turn for example 'häuser' into 'haus',
which the GermanStemFilterFactory should do, according to the documentation.
At the moment, my configuration looks like this:
<fieldtype name="text_ocr" class="solr.TextField" termPositions="true"
termVectors="true" termPayloads="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanStemFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑"
encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder"
/>
<filter class="solr.WordDelimiterGraphFilterFactory"
protected="protectedword.txt"
preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0"
catenateWords="1" catenateNumbers="1" catenateAll="1"
generateWordParts="1" generateNumberParts="1"
stemEnglishPossessive="1"
types="wdfftypes.txt" />
</analyzer>
</fieldtype>
So, Stemming before CharFilter.
But the Solr Analyzer says:
MCF 0 h a e u s e r
WT
text
raw_bytes
start
end
positionLength
type
termFrequency
position
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
LCF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
GSF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
haeu
[68 61 65 75]
0
6
1
word
1
1
false
DPTF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
haeu
[68 61 65 75]
0
6
1
word
1
1
false
WDGF
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
haeu
[68 61 65 75]
0
6
1
word
1
1
false
So, the mappingCharFilter seems to be executed at first, no matter which
position it has in the configuration?
Solr documentation also says, it should be put before the Tokenizer:
https://lucene.apache.org/solr/guide/7_6/charfilterfactories.html
"CharFilters can be chained like Token Filters and placed in front of a
Tokenizer."
But if the word häuser is changed to haeuser, the stemmer doesn't stem the word
anymore :-/
Is there a way to solve this problem?
Thanks a lot, Doris