Re: Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings Wed, 09 Oct 2019 12:11:09 -0700

if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment


On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
<audrey.lorberf...@ibm.com> wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For instance,
> right now we have "ibm" as a stop word because it appears everywhere in our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
>     Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
>     So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
>     --
>     Audrey Lorberfeld
>     Data Scientist, w3 Search
>     IBM
>     audrey.lorberf...@ibm.com
>
>
>     On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com>
> wrote:
>
>         However, with all that said, stopwords CAN be useful in some
> situations.  I
>         combine stopwords with the shingle factory to create "interesting
> phrases"
>         (not really) that i use in "my more like this" needs.  for example,
>         europe for vacation
>         europe on vacation
>         will create the shingle
>         europe_vacation
>         which i can then use to relate other documents that would be much
>         more similar in such regard, rather than just using the
> "interesting words"
>         europe, vacation
>
>         with stop words, the shingles would be
>         europe_for
>         for_vacation
>         and
>         europe_on
>         on_vacation
>
>         just something to keep in mind,  theres a lot of creative ways to
> use
>         stopwords depending on your needs.  i use the above for a VERY
> basic ML
>         teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
>         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
>         wrote:
>
>         > The theory behind stopwords is that they are “safe” to remove
> when
>         > calculating relevance, so we can squeeze every last bit of
> usefulness out
>         > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
>         > come a long way since then and the necessity of removing
> stopwords from the
>         > indexed tokens to conserve RAM and disk is much less relevant
> than it used
>         > to be in “the bad old days” when the idea of stopwords was
> invented.
>         >
>         > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
>         > totally agree that you should remove stopwords only _after_ you
> have some
>         > evidence that removing them is A Good Thing in your situation.
>         >
>         > And removing stopwords leads to some interesting corner cases.
> Consider a
>         > search for “to be or not to be” if they’re all stopwords.
>         >
>         > Best,
>         > Erick
>         >
>         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>         > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>         > >
>         > > Hey Alex,
>         > >
>         > > Thank you!
>         > >
>         > > Re: stopwords being a thing of the past due to the
> affordability of
>         > hardware...can you expand? I'm not sure I understand.
>         > >
>         > > --
>         > > Audrey Lorberfeld
>         > > Data Scientist, w3 Search
>         > > IBM
>         > > audrey.lorberf...@ibm.com
>         > >
>         > >
>         > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recurs...@gmail.com>
>         > wrote:
>         > >
>         > >    Another thing to add to the above,
>         > >>
>         > >> IT:ibm. In this case, we would want to maintain the colon and
> the
>         > >> capitalization (otherwise “it” would be taken out as a
> stopword).
>         > >>
>         > >    stopwords are a thing of the past at this point.  there is
> no benefit
>         > to
>         > >    using them now with hardware being so cheap.
>         > >
>         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>         > arafa...@gmail.com>
>         > >    wrote:
>         > >
>         > >> If you don't want it to be touched by a tokenizer, how would
> the
>         > >> protection step know that the sequence of characters you want
> to
>         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>         > >> protect"?
>         > >>
>         > >> What it sounds to me is that you may want to:
>         > >> 1) copyField to a second field
>         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
> second field
>         > >> 3) Run the results through something like
> KeepWordFilterFactory
>         > >> 4) Search both fields with a boost on the second,
> higher-signal field
>         > >>
>         > >> The other option is to run CharacterFilter,
>         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
> map known
>         > >> complex acronyms to non-tokenizable substitutions. E.g.
> "IT:ibm ->
>         > >> term365". As long as it is done on both indexing and query,
> they will
>         > >> still match. You may have to have a bunch of them or write
> some sort
>         > >> of lookup map.
>         > >>
>         > >> Regards,
>         > >>   Alex.
>         > >>
>         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>         > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>         > >>>
>         > >>> Hi All,
>         > >>>
>         > >>> This is likely a rudimentary question, but I can’t seem to
> find a
>         > >> straight-forward answer on forums or the documentation…is
> there a way to
>         > >> protect tokens from ANY analysis? I know things like the
>         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
> we have
>         > some
>         > >> terms we don’t even want our tokenizer to touch. Mostly,
> these are
>         > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
> want to
>         > >> maintain the colon and the capitalization (otherwise “it”
> would be taken
>         > >> out as a stopword).
>         > >>>
>         > >>> Any advice is appreciated!
>         > >>>
>         > >>> Thank you,
>         > >>> Audrey
>         > >>>
>         > >>> --
>         > >>> Audrey Lorberfeld
>         > >>> Data Scientist, w3 Search
>         > >>> IBM
>         > >>> audrey.lorberf...@ibm.com
>         > >>>
>         > >>
>         > >
>         > >
>         >
>         >
>
>
>
>
>

Re: Re: Re: Re: Protecting Tokens from Any Analysis

Reply via email to