Re: Re: Re: Multi-lingual Search & Accent Marks

Alexandre Rafalovitch Tue, 03 Sep 2019 13:27:54 -0700

What about combining:
1) KeywordRepeatFilterFactory
2) An existing folding filter (need to check it ignores Keyword marked word)
3) RemoveDuplicatesTokenFilterFactory


That may give what you are after without custom coding.

Regards,
   Alex.

On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>
> Toke,
>
> Thank you! That makes a lot of sense.
>
> In other news -- we just had a meeting where we decided to try out a hybrid 
> strategy. I'd love to know what you & everyone else thinks...
>
> - Since we are concerned with the overhead created by "double-fielding" all 
> tokens per language (because I'm not sure how we'd work the logic into Solr 
> to only double-field when an accent is present), we are going to try to do 
> something along the lines of synonym-expansion:
>         - We are going to build a custom plugin that detects diacritics -- 
> upon detection, the plugin would expand the token to both its original form 
> and its ascii-folded term (a la Toke's approach).
>         - However, since we are doing it in a way that mimics synonym 
> expansion, we are going to keep both terms in a single field
>
> The main issue we are anticipating with the above strategy surrounds scoring. 
> Since we will be increasing the frequency of accented terms, we might bias 
> our page ranker...
>
> Has anyone done anything similar (and/or does anyone think this idea is 
> totally the wrong way to go?)
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 9/3/19, 2:58 PM, "Toke Eskildsen" <t...@kb.dk> wrote:
>
>     Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> 
> wrote:
>     > Do you find that searching over both the original title field and the 
> normalized title
>     > field increases the time it takes for your search engine to retrieve 
> results?
>
>     It is not something we have measured as that index is fast enough (which 
> in this context means that we're practically always waiting for the result 
> from an external service that is issued in parallel with the call to our Solr 
> server).
>
>     Technically it's not different from searching across other fields defined 
> in the eDismax setup, so I guess it boils down to "how many fields can you 
> afford to search across?", where our organization's default answer is "as 
> many as we need to get quality matches. Make it work Toke, chop chop". On a 
> more serious note, it is not something I would worry about unless we're 
> talking some special high-performance setup with a budget for tuning: 
> Matching terms and joining filters is core Solr (Lucene really) 
> functionality. Plain query & filter-matching time tend to be dwarfed by 
> aggregations (grouping, faceting, stats).
>
>     - Toke Eskildsen
>
>

Re: Re: Re: Multi-lingual Search & Accent Marks

Reply via email to