Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - audrey.lorberf...@ibm.com Wed, 09 Oct 2019 06:39:02 -0700

Hey Alex,

Thank you!


Re: stopwords being a thing of the past due to the affordability of 
hardware...can you expand? I'm not sure I understand.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/8/19, 1:01 PM, "David Hastings" <hastings.recurs...@gmail.com> wrote:

    Another thing to add to the above,
    >
    > IT:ibm. In this case, we would want to maintain the colon and the
    > capitalization (otherwise “it” would be taken out as a stopword).
    >
    stopwords are a thing of the past at this point.  there is no benefit to
    using them now with hardware being so cheap.
    
    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <arafa...@gmail.com>
    wrote:
    
    > If you don't want it to be touched by a tokenizer, how would the
    > protection step know that the sequence of characters you want to
    > protect is "IT:ibm" and not "this is an IT:ibm term I want to
    > protect"?
    >
    > What it sounds to me is that you may want to:
    > 1) copyField to a second field
    > 2) Apply a much lighter (whitespace?) tokenizer to that second field
    > 3) Run the results through something like KeepWordFilterFactory
    > 4) Search both fields with a boost on the second, higher-signal field
    >
    > The other option is to run CharacterFilter,
    > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
    > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
    > term365". As long as it is done on both indexing and query, they will
    > still match. You may have to have a bunch of them or write some sort
    > of lookup map.
    >
    > Regards,
    >    Alex.
    >
    > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    > >
    > > Hi All,
    > >
    > > This is likely a rudimentary question, but I can’t seem to find a
    > straight-forward answer on forums or the documentation…is there a way to
    > protect tokens from ANY analysis? I know things like the
    > KeywordMarkerFilterFactory protect tokens from stemming, but we have some
    > terms we don’t even want our tokenizer to touch. Mostly, these are
    > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
    > maintain the colon and the capitalization (otherwise “it” would be taken
    > out as a stopword).
    > >
    > > Any advice is appreciated!
    > >
    > > Thank you,
    > > Audrey
    > >
    > > --
    > > Audrey Lorberfeld
    > > Data Scientist, w3 Search
    > > IBM
    > > audrey.lorberf...@ibm.com
    > >
    >

Re: Re: Protecting Tokens from Any Analysis

Reply via email to