Re: Protecting Tokens from Any Analysis

Alexandre Rafalovitch Tue, 08 Oct 2019 09:43:51 -0700

If you don't want it to be touched by a tokenizer, how would the
protection step know that the sequence of characters you want to
protect is "IT:ibm" and not "this is an IT:ibm term I want to
protect"?


What it sounds to me is that you may want to:
1) copyField to a second field
2) Apply a much lighter (whitespace?) tokenizer to that second field
3) Run the results through something like KeepWordFilterFactory
4) Search both fields with a boost on the second, higher-signal field

The other option is to run CharacterFilter,
(PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
term365". As long as it is done on both indexing and query, they will
still match. You may have to have a bunch of them or write some sort
of lookup map.

Regards,
   Alex.

On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>
> Hi All,
>
> This is likely a rudimentary question, but I can’t seem to find a 
> straight-forward answer on forums or the documentation…is there a way to 
> protect tokens from ANY analysis? I know things like the 
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some 
> terms we don’t even want our tokenizer to touch. Mostly, these are 
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to 
> maintain the colon and the capitalization (otherwise “it” would be taken out 
> as a stopword).
>
> Any advice is appreciated!
>
> Thank you,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>

Re: Protecting Tokens from Any Analysis

Reply via email to