Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - audrey.lorberf...@ibm.com Wed, 09 Oct 2019 12:17:26 -0700

True...I guess another rub here is that we're using the edismax parser, so all 
of our queries are inherently OR queries. So for a query like  'the ibm way', 
the search engine would have to:


1) retrieve a document list for:
 -->  "ibm" (this list is probably 80% of the documents)
 -->  "the" (this list is 100%  of the english documents)
 -- >"way"
2) apply edismax parser
 --> foreach term
 -->  -->  foreach document  in term
 -->  -->  -->  score it

So, it seems like it would take a toll on our system.... but maybe that's 
incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we 
get ~80k-100k queries/day)

Are you using edismax?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 3:11 PM, "David Hastings" <hastings.recurs...@gmail.com> wrote:

    if you have anything close to a decent server you wont notice it all.  im
    at about 21 million documents, index varies between 450gb to 800gb
    depending on merges, and about 60k searches a day and stay sub second non
    stop, and this is on a single core/non cloud environment
    
    On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
    <audrey.lorberf...@ibm.com> wrote:
    
    > Also, in terms of computational cost, it would seem that including most
    > terms/not having a stop ilst would take a toll on the system. For 
instance,
    > right now we have "ibm" as a stop word because it appears everywhere in 
our
    > corpus. If we did not include it in the stop words file, we would have to
    > retrieve every single document in our corpus and rank them. That's a high
    > computational cost, no?
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > IBM
    > audrey.lorberf...@ibm.com
    >
    >
    > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
    > audrey.lorberf...@ibm.com> wrote:
    >
    >     Wow, thank you so much, everyone. This is all incredibly helpful
    > insight.
    >
    >     So, would it be fair to say that the majority of you all do NOT use
    > stop words?
    >
    >     --
    >     Audrey Lorberfeld
    >     Data Scientist, w3 Search
    >     IBM
    >     audrey.lorberf...@ibm.com
    >
    >
    >     On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com>
    > wrote:
    >
    >         However, with all that said, stopwords CAN be useful in some
    > situations.  I
    >         combine stopwords with the shingle factory to create "interesting
    > phrases"
    >         (not really) that i use in "my more like this" needs.  for 
example,
    >         europe for vacation
    >         europe on vacation
    >         will create the shingle
    >         europe_vacation
    >         which i can then use to relate other documents that would be much
    >         more similar in such regard, rather than just using the
    > "interesting words"
    >         europe, vacation
    >
    >         with stop words, the shingles would be
    >         europe_for
    >         for_vacation
    >         and
    >         europe_on
    >         on_vacation
    >
    >         just something to keep in mind,  theres a lot of creative ways to
    > use
    >         stopwords depending on your needs.  i use the above for a VERY
    > basic ML
    >         teacher and it works way better than using stopwords,
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
    > erickerick...@gmail.com>
    >         wrote:
    >
    >         > The theory behind stopwords is that they are “safe” to remove
    > when
    >         > calculating relevance, so we can squeeze every last bit of
    > usefulness out
    >         > of very constrained hardware (think 64K of memory. Yes
    > kilobytes). We’ve
    >         > come a long way since then and the necessity of removing
    > stopwords from the
    >         > indexed tokens to conserve RAM and disk is much less relevant
    > than it used
    >         > to be in “the bad old days” when the idea of stopwords was
    > invented.
    >         >
    >         > I’m not quite so confident as Alex that there is “no benefit”,
    > but I’ll
    >         > totally agree that you should remove stopwords only _after_ you
    > have some
    >         > evidence that removing them is A Good Thing in your situation.
    >         >
    >         > And removing stopwords leads to some interesting corner cases.
    > Consider a
    >         > search for “to be or not to be” if they’re all stopwords.
    >         >
    >         > Best,
    >         > Erick
    >         >
    >         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
    >         > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    >         > >
    >         > > Hey Alex,
    >         > >
    >         > > Thank you!
    >         > >
    >         > > Re: stopwords being a thing of the past due to the
    > affordability of
    >         > hardware...can you expand? I'm not sure I understand.
    >         > >
    >         > > --
    >         > > Audrey Lorberfeld
    >         > > Data Scientist, w3 Search
    >         > > IBM
    >         > > audrey.lorberf...@ibm.com
    >         > >
    >         > >
    >         > > On 10/8/19, 1:01 PM, "David Hastings" <
    > hastings.recurs...@gmail.com>
    >         > wrote:
    >         > >
    >         > >    Another thing to add to the above,
    >         > >>
    >         > >> IT:ibm. In this case, we would want to maintain the colon and
    > the
    >         > >> capitalization (otherwise “it” would be taken out as a
    > stopword).
    >         > >>
    >         > >    stopwords are a thing of the past at this point.  there is
    > no benefit
    >         > to
    >         > >    using them now with hardware being so cheap.
    >         > >
    >         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
    >         > arafa...@gmail.com>
    >         > >    wrote:
    >         > >
    >         > >> If you don't want it to be touched by a tokenizer, how would
    > the
    >         > >> protection step know that the sequence of characters you want
    > to
    >         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
    >         > >> protect"?
    >         > >>
    >         > >> What it sounds to me is that you may want to:
    >         > >> 1) copyField to a second field
    >         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
    > second field
    >         > >> 3) Run the results through something like
    > KeepWordFilterFactory
    >         > >> 4) Search both fields with a boost on the second,
    > higher-signal field
    >         > >>
    >         > >> The other option is to run CharacterFilter,
    >         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
    > map known
    >         > >> complex acronyms to non-tokenizable substitutions. E.g.
    > "IT:ibm ->
    >         > >> term365". As long as it is done on both indexing and query,
    > they will
    >         > >> still match. You may have to have a bunch of them or write
    > some sort
    >         > >> of lookup map.
    >         > >>
    >         > >> Regards,
    >         > >>   Alex.
    >         > >>
    >         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    >         > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
    >         > >>>
    >         > >>> Hi All,
    >         > >>>
    >         > >>> This is likely a rudimentary question, but I can’t seem to
    > find a
    >         > >> straight-forward answer on forums or the documentation…is
    > there a way to
    >         > >> protect tokens from ANY analysis? I know things like the
    >         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
    > we have
    >         > some
    >         > >> terms we don’t even want our tokenizer to touch. Mostly,
    > these are
    >         > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
    > want to
    >         > >> maintain the colon and the capitalization (otherwise “it”
    > would be taken
    >         > >> out as a stopword).
    >         > >>>
    >         > >>> Any advice is appreciated!
    >         > >>>
    >         > >>> Thank you,
    >         > >>> Audrey
    >         > >>>
    >         > >>> --
    >         > >>> Audrey Lorberfeld
    >         > >>> Data Scientist, w3 Search
    >         > >>> IBM
    >         > >>> audrey.lorberf...@ibm.com
    >         > >>>
    >         > >>
    >         > >
    >         > >
    >         >
    >         >
    >
    >
    >
    >
    >

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

Reply via email to