Thanks for your reply Scott.
I tried
bs.language=de&bs.country=de
Unfortunately the problem still occurs.
I have just discovered that the problem does not only affect "ß" but
also "æ" (which is mapped to "ae"
at query and index time)
q=hae --> <em>hæna<em>
So it seems to me that the problem is related to any single character
that is map to several characters using <charFilter
class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
Jérôme
Le 13/10/2015 07:46, Scott Stults a écrit :
My guess is that the boundary scanner isn't configured right for your
highlighter. Try setting the bs.language and bs.country parameters either
in your request or in the requestHandler.
k/r,
Scott
On Mon, Oct 5, 2015 at 4:57 AM, Jérôme Bernardes <jerome.bernar...@mappy.com
wrote:
Dear Solr Users,
I am facing a problem with highligting on ngram fields.
Highlighting is working well, except for words with german character
"ß".
Eg : with q=rosen&
"highlighting": {
"gcl3r:12723710:6643": {
"textng": [
"<em>Rosen</em>steinpark (Métro), Stuttgart (Allemagne)"
]
},
"gcl3r:2267495:780930": {
"textng": [
"<em>Rosenstraße</em>, 94554 Moos (Allemagne)"
]
}
}
Without "ß" words are highlight partially <em>Rosen</em>steinpark but
with "ß", the whole word is highlighted (<em>Rosenstraße</em>)
-------------
This characters ß is mapped to "ss" at query and index time (using
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
)
.
Here the schema.xml for the highlighted field.
<fieldType name="autocomplete_ngram" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[\s,;:
\-\']"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
preserveOriginal="1"
types="wdfftypes.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonym.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20"
minGramSize="1"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
\*&æøåÆØÅ ])" replacement="" replace="all"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[\s,;:
\-\']"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnNumerics="0"
generateWordParts="1"
generateNumberParts="0"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"
types="wdfftypes.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d
\*&æøåÆØÅ ])" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>
</analyzer>
</fieldType>
Is it a problem in our configuration or a known bug ?
Regards
Jérôme