Re: Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - audrey.lorberf...@ibm.com Wed, 09 Oct 2019 11:31:45 -0700

Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop words?


-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com> wrote:

    However, with all that said, stopwords CAN be useful in some situations.  I
    combine stopwords with the shingle factory to create "interesting phrases"
    (not really) that i use in "my more like this" needs.  for example,
    europe for vacation
    europe on vacation
    will create the shingle
    europe_vacation
    which i can then use to relate other documents that would be much
    more similar in such regard, rather than just using the "interesting words"
    europe, vacation
    
    with stop words, the shingles would be
    europe_for
    for_vacation
    and
    europe_on
    on_vacation
    
    just something to keep in mind,  theres a lot of creative ways to use
    stopwords depending on your needs.  i use the above for a VERY basic ML
    teacher and it works way better than using stopwords,
    
    
    
    
    
    
    
    
    
    
    
    
    
    On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <erickerick...@gmail.com>
    wrote:
    
    > The theory behind stopwords is that they are “safe” to remove when
    > calculating relevance, so we can squeeze every last bit of usefulness out
    > of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
    > come a long way since then and the necessity of removing stopwords from 
the
    > indexed tokens to conserve RAM and disk is much less relevant than it used
    > to be in “the bad old days” when the idea of stopwords was invented.
    >
    > I’m not quite so confident as Alex that there is “no benefit”, but I’ll
    > totally agree that you should remove stopwords only _after_ you have some
    > evidence that removing them is A Good Thing in your situation.
    >
    > And removing stopwords leads to some interesting corner cases. Consider a
    > search for “to be or not to be” if they’re all stopwords.
    >
    > Best,
    > Erick
    >
    > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
    > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    > >
    > > Hey Alex,
    > >
    > > Thank you!
    > >
    > > Re: stopwords being a thing of the past due to the affordability of
    > hardware...can you expand? I'm not sure I understand.
    > >
    > > --
    > > Audrey Lorberfeld
    > > Data Scientist, w3 Search
    > > IBM
    > > audrey.lorberf...@ibm.com
    > >
    > >
    > > On 10/8/19, 1:01 PM, "David Hastings" <hastings.recurs...@gmail.com>
    > wrote:
    > >
    > >    Another thing to add to the above,
    > >>
    > >> IT:ibm. In this case, we would want to maintain the colon and the
    > >> capitalization (otherwise “it” would be taken out as a stopword).
    > >>
    > >    stopwords are a thing of the past at this point.  there is no benefit
    > to
    > >    using them now with hardware being so cheap.
    > >
    > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
    > arafa...@gmail.com>
    > >    wrote:
    > >
    > >> If you don't want it to be touched by a tokenizer, how would the
    > >> protection step know that the sequence of characters you want to
    > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
    > >> protect"?
    > >>
    > >> What it sounds to me is that you may want to:
    > >> 1) copyField to a second field
    > >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
    > >> 3) Run the results through something like KeepWordFilterFactory
    > >> 4) Search both fields with a boost on the second, higher-signal field
    > >>
    > >> The other option is to run CharacterFilter,
    > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
    > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
    > >> term365". As long as it is done on both indexing and query, they will
    > >> still match. You may have to have a bunch of them or write some sort
    > >> of lookup map.
    > >>
    > >> Regards,
    > >>   Alex.
    > >>
    > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    > >>>
    > >>> Hi All,
    > >>>
    > >>> This is likely a rudimentary question, but I can’t seem to find a
    > >> straight-forward answer on forums or the documentation…is there a way 
to
    > >> protect tokens from ANY analysis? I know things like the
    > >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
    > some
    > >> terms we don’t even want our tokenizer to touch. Mostly, these are
    > >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
    > >> maintain the colon and the capitalization (otherwise “it” would be 
taken
    > >> out as a stopword).
    > >>>
    > >>> Any advice is appreciated!
    > >>>
    > >>> Thank you,
    > >>> Audrey
    > >>>
    > >>> --
    > >>> Audrey Lorberfeld
    > >>> Data Scientist, w3 Search
    > >>> IBM
    > >>> audrey.lorberf...@ibm.com
    > >>>
    > >>
    > >
    > >
    >
    >

Re: Re: Re: Protecting Tokens from Any Analysis

Reply via email to