True...I guess another rub here is that we're using the edismax parser, so all of our queries are inherently OR queries. So for a query like 'the ibm way', the search engine would have to:
1) retrieve a document list for: --> "ibm" (this list is probably 80% of the documents) --> "the" (this list is 100% of the english documents) -- >"way" 2) apply edismax parser --> foreach term --> --> foreach document in term --> --> --> score it So, it seems like it would take a toll on our system.... but maybe that's incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we get ~80k-100k queries/day) Are you using edismax? -- Audrey Lorberfeld Data Scientist, w3 Search IBM audrey.lorberf...@ibm.com On 10/9/19, 3:11 PM, "David Hastings" <hastings.recurs...@gmail.com> wrote: if you have anything close to a decent server you wont notice it all. im at about 21 million documents, index varies between 450gb to 800gb depending on merges, and about 60k searches a day and stay sub second non stop, and this is on a single core/non cloud environment On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > Also, in terms of computational cost, it would seem that including most > terms/not having a stop ilst would take a toll on the system. For instance, > right now we have "ibm" as a stop word because it appears everywhere in our > corpus. If we did not include it in the stop words file, we would have to > retrieve every single document in our corpus and rank them. That's a high > computational cost, no? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" < > audrey.lorberf...@ibm.com> wrote: > > Wow, thank you so much, everyone. This is all incredibly helpful > insight. > > So, would it be fair to say that the majority of you all do NOT use > stop words? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com> > wrote: > > However, with all that said, stopwords CAN be useful in some > situations. I > combine stopwords with the shingle factory to create "interesting > phrases" > (not really) that i use in "my more like this" needs. for example, > europe for vacation > europe on vacation > will create the shingle > europe_vacation > which i can then use to relate other documents that would be much > more similar in such regard, rather than just using the > "interesting words" > europe, vacation > > with stop words, the shingles would be > europe_for > for_vacation > and > europe_on > on_vacation > > just something to keep in mind, theres a lot of creative ways to > use > stopwords depending on your needs. i use the above for a VERY > basic ML > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > erickerick...@gmail.com> > wrote: > > > The theory behind stopwords is that they are “safe” to remove > when > > calculating relevance, so we can squeeze every last bit of > usefulness out > > of very constrained hardware (think 64K of memory. Yes > kilobytes). We’ve > > come a long way since then and the necessity of removing > stopwords from the > > indexed tokens to conserve RAM and disk is much less relevant > than it used > > to be in “the bad old days” when the idea of stopwords was > invented. > > > > I’m not quite so confident as Alex that there is “no benefit”, > but I’ll > > totally agree that you should remove stopwords only _after_ you > have some > > evidence that removing them is A Good Thing in your situation. > > > > And removing stopwords leads to some interesting corner cases. > Consider a > > search for “to be or not to be” if they’re all stopwords. > > > > Best, > > Erick > > > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > > > > > Hey Alex, > > > > > > Thank you! > > > > > > Re: stopwords being a thing of the past due to the > affordability of > > hardware...can you expand? I'm not sure I understand. > > > > > > -- > > > Audrey Lorberfeld > > > Data Scientist, w3 Search > > > IBM > > > audrey.lorberf...@ibm.com > > > > > > > > > On 10/8/19, 1:01 PM, "David Hastings" < > hastings.recurs...@gmail.com> > > wrote: > > > > > > Another thing to add to the above, > > >> > > >> IT:ibm. In this case, we would want to maintain the colon and > the > > >> capitalization (otherwise “it” would be taken out as a > stopword). > > >> > > > stopwords are a thing of the past at this point. there is > no benefit > > to > > > using them now with hardware being so cheap. > > > > > > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > > arafa...@gmail.com> > > > wrote: > > > > > >> If you don't want it to be touched by a tokenizer, how would > the > > >> protection step know that the sequence of characters you want > to > > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to > > >> protect"? > > >> > > >> What it sounds to me is that you may want to: > > >> 1) copyField to a second field > > >> 2) Apply a much lighter (whitespace?) tokenizer to that > second field > > >> 3) Run the results through something like > KeepWordFilterFactory > > >> 4) Search both fields with a boost on the second, > higher-signal field > > >> > > >> The other option is to run CharacterFilter, > > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to > map known > > >> complex acronyms to non-tokenizable substitutions. E.g. > "IT:ibm -> > > >> term365". As long as it is done on both indexing and query, > they will > > >> still match. You may have to have a bunch of them or write > some sort > > >> of lookup map. > > >> > > >> Regards, > > >> Alex. > > >> > > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > >>> > > >>> Hi All, > > >>> > > >>> This is likely a rudimentary question, but I can’t seem to > find a > > >> straight-forward answer on forums or the documentation…is > there a way to > > >> protect tokens from ANY analysis? I know things like the > > >> KeywordMarkerFilterFactory protect tokens from stemming, but > we have > > some > > >> terms we don’t even want our tokenizer to touch. Mostly, > these are > > >> IBM-specific acronyms, such as IT:ibm. In this case, we would > want to > > >> maintain the colon and the capitalization (otherwise “it” > would be taken > > >> out as a stopword). > > >>> > > >>> Any advice is appreciated! > > >>> > > >>> Thank you, > > >>> Audrey > > >>> > > >>> -- > > >>> Audrey Lorberfeld > > >>> Data Scientist, w3 Search > > >>> IBM > > >>> audrey.lorberf...@ibm.com > > >>> > > >> > > > > > > > > > > > > > > >