If you don't want it to be touched by a tokenizer, how would the protection step know that the sequence of characters you want to protect is "IT:ibm" and not "this is an IT:ibm term I want to protect"?
What it sounds to me is that you may want to: 1) copyField to a second field 2) Apply a much lighter (whitespace?) tokenizer to that second field 3) Run the results through something like KeepWordFilterFactory 4) Search both fields with a boost on the second, higher-signal field The other option is to run CharacterFilter, (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> term365". As long as it is done on both indexing and query, they will still match. You may have to have a bunch of them or write some sort of lookup map. Regards, Alex. On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > Hi All, > > This is likely a rudimentary question, but I can’t seem to find a > straight-forward answer on forums or the documentation…is there a way to > protect tokens from ANY analysis? I know things like the > KeywordMarkerFilterFactory protect tokens from stemming, but we have some > terms we don’t even want our tokenizer to touch. Mostly, these are > IBM-specific acronyms, such as IT:ibm. In this case, we would want to > maintain the colon and the capitalization (otherwise “it” would be taken out > as a stopword). > > Any advice is appreciated! > > Thank you, > Audrey > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com >