Hi, another problem with the stemming:
Most of our texts are in German, so we use the GermanStemFilterFactory. But we also use MappingCharFilterFactory where we map for example ä->ae. But of course we want the stemming to turn for example 'häuser' into 'haus', which the GermanStemFilterFactory should do, according to the documentation. At the moment, my configuration looks like this: <fieldtype name="text_ocr" class="solr.TextField" termPositions="true" termVectors="true" termPayloads="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.GermanStemFilterFactory"/> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/> <filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑" encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder" /> <filter class="solr.WordDelimiterGraphFilterFactory" protected="protectedword.txt" preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0" catenateWords="1" catenateNumbers="1" catenateAll="1" generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1" types="wdfftypes.txt" /> </analyzer> </fieldtype> So, Stemming before CharFilter. But the Solr Analyzer says: MCF 0 h a e u s e r WT text raw_bytes start end positionLength type termFrequency position haeuser [68 61 65 75 73 65 72] 0 6 1 word 1 1 LCF text raw_bytes start end positionLength type termFrequency position haeuser [68 61 65 75 73 65 72] 0 6 1 word 1 1 GSF text raw_bytes start end positionLength type termFrequency position keyword haeu [68 61 65 75] 0 6 1 word 1 1 false DPTF text raw_bytes start end positionLength type termFrequency position keyword payload haeu [68 61 65 75] 0 6 1 word 1 1 false WDGF text raw_bytes start end positionLength type termFrequency position keyword payload haeu [68 61 65 75] 0 6 1 word 1 1 false So, the mappingCharFilter seems to be executed at first, no matter which position it has in the configuration? Solr documentation also says, it should be put before the Tokenizer: https://lucene.apache.org/solr/guide/7_6/charfilterfactories.html "CharFilters can be chained like Token Filters and placed in front of a Tokenizer." But if the word häuser is changed to haeuser, the stemmer doesn't stem the word anymore :-/ Is there a way to solve this problem? Thanks a lot, Doris