Re: Re: Protecting Tokens from Any Analysis

Walter Underwood Wed, 09 Oct 2019 07:39:46 -0700

Stopwords were used when we were running search engines on 16-bit computers 
with 50 Megabyte disks, like the PDP-11. They avoided storing and processing 
long posting lists.


Think of removing stopwords as a binary weighting on frequent terms, either on 
or off (not in the index). With idf, we have a proportional weighting for 
frequent terms. That gives better results than binary weighting.

Removing stopwords makes some searches impossible. The classic example is “to 
be or not to be”, which is 100% stopwords. This is a real-world problem. When I 
was building search for Netflix a dozen years ago, I hit several movie or TV 
titles which were all stopwords. I wrote about them in this blog post.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2019, at 6:38 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
> <audrey.lorberf...@ibm.com> wrote:
> 
> Hey Alex,
> 
> Thank you!
> 
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 10/8/19, 1:01 PM, "David Hastings" <hastings.recurs...@gmail.com> wrote:
> 
>    Another thing to add to the above,
>> 
>> IT:ibm. In this case, we would want to maintain the colon and the
>> capitalization (otherwise “it” would be taken out as a stopword).
>> 
>    stopwords are a thing of the past at this point.  there is no benefit to
>    using them now with hardware being so cheap.
> 
>    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <arafa...@gmail.com>
>    wrote:
> 
>> If you don't want it to be touched by a tokenizer, how would the
>> protection step know that the sequence of characters you want to
>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>> protect"?
>> 
>> What it sounds to me is that you may want to:
>> 1) copyField to a second field
>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>> 3) Run the results through something like KeepWordFilterFactory
>> 4) Search both fields with a boost on the second, higher-signal field
>> 
>> The other option is to run CharacterFilter,
>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>> term365". As long as it is done on both indexing and query, they will
>> still match. You may have to have a bunch of them or write some sort
>> of lookup map.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> This is likely a rudimentary question, but I can’t seem to find a
>> straight-forward answer on forums or the documentation…is there a way to
>> protect tokens from ANY analysis? I know things like the
>> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>> terms we don’t even want our tokenizer to touch. Mostly, these are
>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>> maintain the colon and the capitalization (otherwise “it” would be taken
>> out as a stopword).
>>> 
>>> Any advice is appreciated!
>>> 
>>> Thank you,
>>> Audrey
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>> 
> 
>

Re: Re: Protecting Tokens from Any Analysis

Reply via email to