What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory
That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > Toke, > > Thank you! That makes a lot of sense. > > In other news -- we just had a meeting where we decided to try out a hybrid > strategy. I'd love to know what you & everyone else thinks... > > - Since we are concerned with the overhead created by "double-fielding" all > tokens per language (because I'm not sure how we'd work the logic into Solr > to only double-field when an accent is present), we are going to try to do > something along the lines of synonym-expansion: > - We are going to build a custom plugin that detects diacritics -- > upon detection, the plugin would expand the token to both its original form > and its ascii-folded term (a la Toke's approach). > - However, since we are doing it in a way that mimics synonym > expansion, we are going to keep both terms in a single field > > The main issue we are anticipating with the above strategy surrounds scoring. > Since we will be increasing the frequency of accented terms, we might bias > our page ranker... > > Has anyone done anything similar (and/or does anyone think this idea is > totally the wrong way to go?) > > Best, > Audrey > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 9/3/19, 2:58 PM, "Toke Eskildsen" <t...@kb.dk> wrote: > > Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> > wrote: > > Do you find that searching over both the original title field and the > normalized title > > field increases the time it takes for your search engine to retrieve > results? > > It is not something we have measured as that index is fast enough (which > in this context means that we're practically always waiting for the result > from an external service that is issued in parallel with the call to our Solr > server). > > Technically it's not different from searching across other fields defined > in the eDismax setup, so I guess it boils down to "how many fields can you > afford to search across?", where our organization's default answer is "as > many as we need to get quality matches. Make it work Toke, chop chop". On a > more serious note, it is not something I would worry about unless we're > talking some special high-performance setup with a budget for tuning: > Matching terms and joining filters is core Solr (Lucene really) > functionality. Plain query & filter-matching time tend to be dwarfed by > aggregations (grouping, faceting, stats). > > - Toke Eskildsen > >