Re: Problem with solr suggester in case of non-ASCII characters

Szűcs Roland Wed, 31 Jul 2019 05:39:21 -0700

Hi Erick,

Thanks your advice.
I already removed it from the field definition used by the suggester and it
works great. I will consider to took it from the entire processing of the
other fields. I have only 7000 docs with index size of 18MB so far, so  the
memory footprint is not a key issue for me.


Best,
Roland

Erick Erickson <erickerick...@gmail.com> ezt írta (időpont: 2019. júl. 31.,
Sze, 14:24):

> Roland:
>
> Have you considered just not using stopwords anywhere? Largely they’re a
> holdover
> from a long time ago when every byte counted. Plus using stopwords has
> “interesting”
> issues with things like highlighting and phrase queries and the like.
>
> Sure, not using stopwords will make your index larger, but so will a
> copyfield…
>
> Your call of course, but stopwords are over-used IMO.
>
> I’m stealing Walter Underwood’s thunder here ;)
>
> Best,
> Erick
>
> > On Jul 30, 2019, at 2:11 PM, Szűcs Roland <szucs.rol...@bookandwalk.hu>
> wrote:
> >
> > Hi Furkan,
> >
> > Thanks the suggestion, I always forget the most effective debugging tool
> > the analysis page.
> >
> > It turned out that "Jó" was a stop word and it was eliminated during the
> > text analysis. What I will do is to create a new field type but without
> > stop word removal and I will use it like this:
> > <str
> > name="suggestAnalyzerFieldType">short_text_hu_without_stop_removal</str>
> >
> > Thanks again
> >
> > Roland
> >
> > Furkan KAMACI <furkankam...@gmail.com> ezt írta (időpont: 2019. júl.
> 30.,
> > K, 16:17):
> >
> >> Hi Roland,
> >>
> >> Could you check Analysis tab (
> >> https://lucene.apache.org/solr/guide/8_1/analysis-screen.html) and tell
> >> how
> >> the term is analyzed for both query and index?
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >> On Tue, Jul 30, 2019 at 4:50 PM Szűcs Roland <
> szucs.rol...@bookandwalk.hu>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I have an author suggester (searchcomponent and the related request
> >>> handler) defined in solrconfig:
> >>> <searchComponent name="suggest" class="solr.SuggestComponent">
> >>>    <!-- All suggester component must have different filepath to avoid
> >>>    write lock issues-->>
> >>>    <lst name="suggester">
> >>>      <str name="name">author</str>
> >>>      <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
> >>>      <str name="dictionaryImpl">DocumentDictionaryFactory</str>
> >>>      <str name="field">BOOK_productAuthor</str>
> >>>      <str name="suggestAnalyzerFieldType">short_text_hu</str>
> >>>      <str name="indexPath">suggester_infix_author</str>
> >>>      <str name="buildOnStartup">false</str>
> >>>      <str name="buildOnCommit">false</str>
> >>>      <str name="minPrefixChars">2</str>
> >>>    </lst>
> >>> </searchComponent>
> >>>
> >>> <requestHandler name="/suggesthandler" class="solr.SearchHandler"
> >>> startup="lazy" >
> >>> <lst name="defaults">
> >>>  <str name="suggest">true</str>
> >>>  <str name="suggest.count">10</str>
> >>>  <str name="suggest.dictionary">author</str>
> >>> </lst>
> >>> <arr name="components">
> >>>  <str>suggest</str>
> >>> </arr>
> >>> </requestHandler>
> >>>
> >>> Author field has just a minimal text processing in query and index time
> >>> based on the following definition:
> >>> <fieldType name="short_text_hu" class="solr.TextField"
> >>> positionIncrementGap="100" multiValued="true">
> >>>    <analyzer type="index">
> >>>      <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >>>      <tokenizer class="solr.ClassicTokenizerFactory"/>
> >>>      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> >>> ignoreCase="true"/>
> >>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>    </analyzer>
> >>>    <analyzer type="query">
> >>>      <tokenizer class="solr.ClassicTokenizerFactory"/>
> >>>      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> >>> ignoreCase="true"/>
> >>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>    </analyzer>
> >>>  </fieldType>
> >>>  <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >>> docValues="true"/>
> >>>  <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
> >>> docValues="true" multiValued="true"/>
> >>>  <fieldType name="text_ar" class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>    <analyzer>
> >>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>      <filter class="solr.StopFilterFactory"
> >> words="lang/stopwords_ar.txt"
> >>> ignoreCase="true"/>
> >>>      <filter class="solr.ArabicNormalizationFilterFactory"/>
> >>>      <filter class="solr.ArabicStemFilterFactory"/>
> >>>    </analyzer>
> >>>  </fieldType>
> >>>
> >>> When I use qeries with only ASCII characters, the results are correct:
> >>> "Al":{
> >>> "term":"<b>Al</b>exandre Dumas", "weight":0, "payload":""}
> >>>
> >>> When I try it with Hungarian authorname with special character:
> >>> "Jó":"author":{
> >>> "Jó":{ "numFound":0, "suggestions":[]}}
> >>>
> >>> When I try it with three letters, it works again:
> >>> "Józ":"author":{
> >>> "Józ":{ "numFound":10, "suggestions":[{ "term":"Bajza <b>Józ</b>sef", "
> >>> weight":0, "payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
> "
> >>> payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
> >> "payload":""}, {
> >>> "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
> >>> "term":"<b>Józ</b>sef
> >>> Attila", "weight":0, "payload":""}..
> >>>
> >>> Any idea how can it happen that a longer string has more matches than a
> >>> shorter one. It is inconsistent. What can I do to fix it as it would
> >>> results poor customer experience.
> >>> They would feel that sometimes they need 2 sometimes 3 characters to
> get
> >>> suggestions.
> >>>
> >>> Thanks in advance,
> >>> Roland
> >>>
> >>
>
>

Re: Problem with solr suggester in case of non-ASCII characters

Reply via email to