Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings Wed, 09 Oct 2019 12:35:14 -0700

yup.  youre going to find solr is WAY more efficient than you think when it
comes to complex queries.


On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
<audrey.lorberf...@ibm.com> wrote:

> True...I guess another rub here is that we're using the edismax parser, so
> all of our queries are inherently OR queries. So for a query like  'the ibm
> way', the search engine would have to:
>
> 1) retrieve a document list for:
>  -->  "ibm" (this list is probably 80% of the documents)
>  -->  "the" (this list is 100%  of the english documents)
>  -- >"way"
> 2) apply edismax parser
>  --> foreach term
>  -->  -->  foreach document  in term
>  -->  -->  -->  score it
>
> So, it seems like it would take a toll on our system.... but maybe that's
> incorrect! (For reference, our corpus is ~5MM documents, multi-language,
> and we get ~80k-100k queries/day)
>
> Are you using edismax?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 3:11 PM, "David Hastings" <hastings.recurs...@gmail.com>
> wrote:
>
>     if you have anything close to a decent server you wont notice it all.
> im
>     at about 21 million documents, index varies between 450gb to 800gb
>     depending on merges, and about 60k searches a day and stay sub second
> non
>     stop, and this is on a single core/non cloud environment
>
>     On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com
>     <audrey.lorberf...@ibm.com> wrote:
>
>     > Also, in terms of computational cost, it would seem that including
> most
>     > terms/not having a stop ilst would take a toll on the system. For
> instance,
>     > right now we have "ibm" as a stop word because it appears everywhere
> in our
>     > corpus. If we did not include it in the stop words file, we would
> have to
>     > retrieve every single document in our corpus and rank them. That's a
> high
>     > computational cost, no?
>     >
>     > --
>     > Audrey Lorberfeld
>     > Data Scientist, w3 Search
>     > IBM
>     > audrey.lorberf...@ibm.com
>     >
>     >
>     > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com"
> <
>     > audrey.lorberf...@ibm.com> wrote:
>     >
>     >     Wow, thank you so much, everyone. This is all incredibly helpful
>     > insight.
>     >
>     >     So, would it be fair to say that the majority of you all do NOT
> use
>     > stop words?
>     >
>     >     --
>     >     Audrey Lorberfeld
>     >     Data Scientist, w3 Search
>     >     IBM
>     >     audrey.lorberf...@ibm.com
>     >
>     >
>     >     On 10/9/19, 11:14 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
>     > wrote:
>     >
>     >         However, with all that said, stopwords CAN be useful in some
>     > situations.  I
>     >         combine stopwords with the shingle factory to create
> "interesting
>     > phrases"
>     >         (not really) that i use in "my more like this" needs.  for
> example,
>     >         europe for vacation
>     >         europe on vacation
>     >         will create the shingle
>     >         europe_vacation
>     >         which i can then use to relate other documents that would be
> much
>     >         more similar in such regard, rather than just using the
>     > "interesting words"
>     >         europe, vacation
>     >
>     >         with stop words, the shingles would be
>     >         europe_for
>     >         for_vacation
>     >         and
>     >         europe_on
>     >         on_vacation
>     >
>     >         just something to keep in mind,  theres a lot of creative
> ways to
>     > use
>     >         stopwords depending on your needs.  i use the above for a
> VERY
>     > basic ML
>     >         teacher and it works way better than using stopwords,
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>     > erickerick...@gmail.com>
>     >         wrote:
>     >
>     >         > The theory behind stopwords is that they are “safe” to
> remove
>     > when
>     >         > calculating relevance, so we can squeeze every last bit of
>     > usefulness out
>     >         > of very constrained hardware (think 64K of memory. Yes
>     > kilobytes). We’ve
>     >         > come a long way since then and the necessity of removing
>     > stopwords from the
>     >         > indexed tokens to conserve RAM and disk is much less
> relevant
>     > than it used
>     >         > to be in “the bad old days” when the idea of stopwords was
>     > invented.
>     >         >
>     >         > I’m not quite so confident as Alex that there is “no
> benefit”,
>     > but I’ll
>     >         > totally agree that you should remove stopwords only
> _after_ you
>     > have some
>     >         > evidence that removing them is A Good Thing in your
> situation.
>     >         >
>     >         > And removing stopwords leads to some interesting corner
> cases.
>     > Consider a
>     >         > search for “to be or not to be” if they’re all stopwords.
>     >         >
>     >         > Best,
>     >         > Erick
>     >         >
>     >         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>     >         > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com>
> wrote:
>     >         > >
>     >         > > Hey Alex,
>     >         > >
>     >         > > Thank you!
>     >         > >
>     >         > > Re: stopwords being a thing of the past due to the
>     > affordability of
>     >         > hardware...can you expand? I'm not sure I understand.
>     >         > >
>     >         > > --
>     >         > > Audrey Lorberfeld
>     >         > > Data Scientist, w3 Search
>     >         > > IBM
>     >         > > audrey.lorberf...@ibm.com
>     >         > >
>     >         > >
>     >         > > On 10/8/19, 1:01 PM, "David Hastings" <
>     > hastings.recurs...@gmail.com>
>     >         > wrote:
>     >         > >
>     >         > >    Another thing to add to the above,
>     >         > >>
>     >         > >> IT:ibm. In this case, we would want to maintain the
> colon and
>     > the
>     >         > >> capitalization (otherwise “it” would be taken out as a
>     > stopword).
>     >         > >>
>     >         > >    stopwords are a thing of the past at this point.
> there is
>     > no benefit
>     >         > to
>     >         > >    using them now with hardware being so cheap.
>     >         > >
>     >         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch
> <
>     >         > arafa...@gmail.com>
>     >         > >    wrote:
>     >         > >
>     >         > >> If you don't want it to be touched by a tokenizer, how
> would
>     > the
>     >         > >> protection step know that the sequence of characters
> you want
>     > to
>     >         > >> protect is "IT:ibm" and not "this is an IT:ibm term I
> want to
>     >         > >> protect"?
>     >         > >>
>     >         > >> What it sounds to me is that you may want to:
>     >         > >> 1) copyField to a second field
>     >         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
>     > second field
>     >         > >> 3) Run the results through something like
>     > KeepWordFilterFactory
>     >         > >> 4) Search both fields with a boost on the second,
>     > higher-signal field
>     >         > >>
>     >         > >> The other option is to run CharacterFilter,
>     >         > >> (PatternReplaceCharFilterFactory) which is
> pre-tokenizer to
>     > map known
>     >         > >> complex acronyms to non-tokenizable substitutions. E.g.
>     > "IT:ibm ->
>     >         > >> term365". As long as it is done on both indexing and
> query,
>     > they will
>     >         > >> still match. You may have to have a bunch of them or
> write
>     > some sort
>     >         > >> of lookup map.
>     >         > >>
>     >         > >> Regards,
>     >         > >>   Alex.
>     >         > >>
>     >         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     >         > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com>
> wrote:
>     >         > >>>
>     >         > >>> Hi All,
>     >         > >>>
>     >         > >>> This is likely a rudimentary question, but I can’t
> seem to
>     > find a
>     >         > >> straight-forward answer on forums or the
> documentation…is
>     > there a way to
>     >         > >> protect tokens from ANY analysis? I know things like the
>     >         > >> KeywordMarkerFilterFactory protect tokens from
> stemming, but
>     > we have
>     >         > some
>     >         > >> terms we don’t even want our tokenizer to touch. Mostly,
>     > these are
>     >         > >> IBM-specific acronyms, such as IT:ibm. In this case, we
> would
>     > want to
>     >         > >> maintain the colon and the capitalization (otherwise
> “it”
>     > would be taken
>     >         > >> out as a stopword).
>     >         > >>>
>     >         > >>> Any advice is appreciated!
>     >         > >>>
>     >         > >>> Thank you,
>     >         > >>> Audrey
>     >         > >>>
>     >         > >>> --
>     >         > >>> Audrey Lorberfeld
>     >         > >>> Data Scientist, w3 Search
>     >         > >>> IBM
>     >         > >>> audrey.lorberf...@ibm.com
>     >         > >>>
>     >         > >>
>     >         > >
>     >         > >
>     >         >
>     >         >
>     >
>     >
>     >
>     >
>     >
>
>
>

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

Reply via email to