Problem with solr suggester in case of non-ASCII characters

Szűcs Roland Tue, 30 Jul 2019 06:50:32 -0700

Hi All,

I have an author suggester (searchcomponent and the related request
handler) defined in solrconfig:
<searchComponent name="suggest" class="solr.SuggestComponent">
    <!-- All suggester component must have different filepath to avoid
    write lock issues-->>
    <lst name="suggester">
      <str name="name">author</str>
      <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>
      <str name="field">BOOK_productAuthor</str>
      <str name="suggestAnalyzerFieldType">short_text_hu</str>
      <str name="indexPath">suggester_infix_author</str>
      <str name="buildOnStartup">false</str>
      <str name="buildOnCommit">false</str>
      <str name="minPrefixChars">2</str>
    </lst>
</searchComponent>


<requestHandler name="/suggesthandler" class="solr.SearchHandler"
startup="lazy" >
<lst name="defaults">
  <str name="suggest">true</str>
  <str name="suggest.count">10</str>
  <str name="suggest.dictionary">author</str>
</lst>
<arr name="components">
  <str>suggest</str>
</arr>
</requestHandler>

Author field has just a minimal text processing in query and index time
based on the following definition:
<fieldType name="short_text_hu" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.ClassicTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="string" class="solr.StrField" sortMissingLast="true"
docValues="true"/>
  <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
docValues="true" multiValued="true"/>
  <fieldType name="text_ar" class="solr.TextField"
positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt"
ignoreCase="true"/>
      <filter class="solr.ArabicNormalizationFilterFactory"/>
      <filter class="solr.ArabicStemFilterFactory"/>
    </analyzer>
  </fieldType>

When I use qeries with only ASCII characters, the results are correct:
"Al":{
"term":"<b>Al</b>exandre Dumas", "weight":0, "payload":""}

When I try it with Hungarian authorname with special character:
"Jó":"author":{
"Jó":{ "numFound":0, "suggestions":[]}}

When I try it with three letters, it works again:
"Józ":"author":{
"Józ":{ "numFound":10, "suggestions":[{ "term":"Bajza <b>Józ</b>sef", "
weight":0, "payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0, "
payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
"term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
"term":"<b>Józ</b>sef
Attila", "weight":0, "payload":""}..

Any idea how can it happen that a longer string has more matches than a
shorter one. It is inconsistent. What can I do to fix it as it would
results poor customer experience.
They would feel that sometimes they need 2 sometimes 3 characters to get
suggestions.

Thanks in advance,
Roland

Problem with solr suggester in case of non-ASCII characters

Reply via email to