yup. youre going to find solr is WAY more efficient than you think when it comes to complex queries.
On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > True...I guess another rub here is that we're using the edismax parser, so > all of our queries are inherently OR queries. So for a query like 'the ibm > way', the search engine would have to: > > 1) retrieve a document list for: > --> "ibm" (this list is probably 80% of the documents) > --> "the" (this list is 100% of the english documents) > -- >"way" > 2) apply edismax parser > --> foreach term > --> --> foreach document in term > --> --> --> score it > > So, it seems like it would take a toll on our system.... but maybe that's > incorrect! (For reference, our corpus is ~5MM documents, multi-language, > and we get ~80k-100k queries/day) > > Are you using edismax? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 3:11 PM, "David Hastings" <hastings.recurs...@gmail.com> > wrote: > > if you have anything close to a decent server you wont notice it all. > im > at about 21 million documents, index varies between 450gb to 800gb > depending on merges, and about 60k searches a day and stay sub second > non > stop, and this is on a single core/non cloud environment > > On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - > audrey.lorberf...@ibm.com > <audrey.lorberf...@ibm.com> wrote: > > > Also, in terms of computational cost, it would seem that including > most > > terms/not having a stop ilst would take a toll on the system. For > instance, > > right now we have "ibm" as a stop word because it appears everywhere > in our > > corpus. If we did not include it in the stop words file, we would > have to > > retrieve every single document in our corpus and rank them. That's a > high > > computational cost, no? > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" > < > > audrey.lorberf...@ibm.com> wrote: > > > > Wow, thank you so much, everyone. This is all incredibly helpful > > insight. > > > > So, would it be fair to say that the majority of you all do NOT > use > > stop words? > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/9/19, 11:14 AM, "David Hastings" < > hastings.recurs...@gmail.com> > > wrote: > > > > However, with all that said, stopwords CAN be useful in some > > situations. I > > combine stopwords with the shingle factory to create > "interesting > > phrases" > > (not really) that i use in "my more like this" needs. for > example, > > europe for vacation > > europe on vacation > > will create the shingle > > europe_vacation > > which i can then use to relate other documents that would be > much > > more similar in such regard, rather than just using the > > "interesting words" > > europe, vacation > > > > with stop words, the shingles would be > > europe_for > > for_vacation > > and > > europe_on > > on_vacation > > > > just something to keep in mind, theres a lot of creative > ways to > > use > > stopwords depending on your needs. i use the above for a > VERY > > basic ML > > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > > erickerick...@gmail.com> > > wrote: > > > > > The theory behind stopwords is that they are “safe” to > remove > > when > > > calculating relevance, so we can squeeze every last bit of > > usefulness out > > > of very constrained hardware (think 64K of memory. Yes > > kilobytes). We’ve > > > come a long way since then and the necessity of removing > > stopwords from the > > > indexed tokens to conserve RAM and disk is much less > relevant > > than it used > > > to be in “the bad old days” when the idea of stopwords was > > invented. > > > > > > I’m not quite so confident as Alex that there is “no > benefit”, > > but I’ll > > > totally agree that you should remove stopwords only > _after_ you > > have some > > > evidence that removing them is A Good Thing in your > situation. > > > > > > And removing stopwords leads to some interesting corner > cases. > > Consider a > > > search for “to be or not to be” if they’re all stopwords. > > > > > > Best, > > > Erick > > > > > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > > > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> > wrote: > > > > > > > > Hey Alex, > > > > > > > > Thank you! > > > > > > > > Re: stopwords being a thing of the past due to the > > affordability of > > > hardware...can you expand? I'm not sure I understand. > > > > > > > > -- > > > > Audrey Lorberfeld > > > > Data Scientist, w3 Search > > > > IBM > > > > audrey.lorberf...@ibm.com > > > > > > > > > > > > On 10/8/19, 1:01 PM, "David Hastings" < > > hastings.recurs...@gmail.com> > > > wrote: > > > > > > > > Another thing to add to the above, > > > >> > > > >> IT:ibm. In this case, we would want to maintain the > colon and > > the > > > >> capitalization (otherwise “it” would be taken out as a > > stopword). > > > >> > > > > stopwords are a thing of the past at this point. > there is > > no benefit > > > to > > > > using them now with hardware being so cheap. > > > > > > > > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch > < > > > arafa...@gmail.com> > > > > wrote: > > > > > > > >> If you don't want it to be touched by a tokenizer, how > would > > the > > > >> protection step know that the sequence of characters > you want > > to > > > >> protect is "IT:ibm" and not "this is an IT:ibm term I > want to > > > >> protect"? > > > >> > > > >> What it sounds to me is that you may want to: > > > >> 1) copyField to a second field > > > >> 2) Apply a much lighter (whitespace?) tokenizer to that > > second field > > > >> 3) Run the results through something like > > KeepWordFilterFactory > > > >> 4) Search both fields with a boost on the second, > > higher-signal field > > > >> > > > >> The other option is to run CharacterFilter, > > > >> (PatternReplaceCharFilterFactory) which is > pre-tokenizer to > > map known > > > >> complex acronyms to non-tokenizable substitutions. E.g. > > "IT:ibm -> > > > >> term365". As long as it is done on both indexing and > query, > > they will > > > >> still match. You may have to have a bunch of them or > write > > some sort > > > >> of lookup map. > > > >> > > > >> Regards, > > > >> Alex. > > > >> > > > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > > > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> > wrote: > > > >>> > > > >>> Hi All, > > > >>> > > > >>> This is likely a rudimentary question, but I can’t > seem to > > find a > > > >> straight-forward answer on forums or the > documentation…is > > there a way to > > > >> protect tokens from ANY analysis? I know things like the > > > >> KeywordMarkerFilterFactory protect tokens from > stemming, but > > we have > > > some > > > >> terms we don’t even want our tokenizer to touch. Mostly, > > these are > > > >> IBM-specific acronyms, such as IT:ibm. In this case, we > would > > want to > > > >> maintain the colon and the capitalization (otherwise > “it” > > would be taken > > > >> out as a stopword). > > > >>> > > > >>> Any advice is appreciated! > > > >>> > > > >>> Thank you, > > > >>> Audrey > > > >>> > > > >>> -- > > > >>> Audrey Lorberfeld > > > >>> Data Scientist, w3 Search > > > >>> IBM > > > >>> audrey.lorberf...@ibm.com > > > >>> > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >