only in my more like this tools, but they have a very specific purpose, otherwise no
On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > Wow, thank you so much, everyone. This is all incredibly helpful insight. > > So, would it be fair to say that the majority of you all do NOT use stop > words? > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/9/19, 11:14 AM, "David Hastings" <hastings.recurs...@gmail.com> > wrote: > > However, with all that said, stopwords CAN be useful in some > situations. I > combine stopwords with the shingle factory to create "interesting > phrases" > (not really) that i use in "my more like this" needs. for example, > europe for vacation > europe on vacation > will create the shingle > europe_vacation > which i can then use to relate other documents that would be much > more similar in such regard, rather than just using the "interesting > words" > europe, vacation > > with stop words, the shingles would be > europe_for > for_vacation > and > europe_on > on_vacation > > just something to keep in mind, theres a lot of creative ways to use > stopwords depending on your needs. i use the above for a VERY basic ML > teacher and it works way better than using stopwords, > > > > > > > > > > > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson < > erickerick...@gmail.com> > wrote: > > > The theory behind stopwords is that they are “safe” to remove when > > calculating relevance, so we can squeeze every last bit of > usefulness out > > of very constrained hardware (think 64K of memory. Yes kilobytes). > We’ve > > come a long way since then and the necessity of removing stopwords > from the > > indexed tokens to conserve RAM and disk is much less relevant than > it used > > to be in “the bad old days” when the idea of stopwords was invented. > > > > I’m not quite so confident as Alex that there is “no benefit”, but > I’ll > > totally agree that you should remove stopwords only _after_ you have > some > > evidence that removing them is A Good Thing in your situation. > > > > And removing stopwords leads to some interesting corner cases. > Consider a > > search for “to be or not to be” if they’re all stopwords. > > > > Best, > > Erick > > > > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > > > > > Hey Alex, > > > > > > Thank you! > > > > > > Re: stopwords being a thing of the past due to the affordability of > > hardware...can you expand? I'm not sure I understand. > > > > > > -- > > > Audrey Lorberfeld > > > Data Scientist, w3 Search > > > IBM > > > audrey.lorberf...@ibm.com > > > > > > > > > On 10/8/19, 1:01 PM, "David Hastings" < > hastings.recurs...@gmail.com> > > wrote: > > > > > > Another thing to add to the above, > > >> > > >> IT:ibm. In this case, we would want to maintain the colon and the > > >> capitalization (otherwise “it” would be taken out as a stopword). > > >> > > > stopwords are a thing of the past at this point. there is no > benefit > > to > > > using them now with hardware being so cheap. > > > > > > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > > arafa...@gmail.com> > > > wrote: > > > > > >> If you don't want it to be touched by a tokenizer, how would the > > >> protection step know that the sequence of characters you want to > > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to > > >> protect"? > > >> > > >> What it sounds to me is that you may want to: > > >> 1) copyField to a second field > > >> 2) Apply a much lighter (whitespace?) tokenizer to that second > field > > >> 3) Run the results through something like KeepWordFilterFactory > > >> 4) Search both fields with a boost on the second, higher-signal > field > > >> > > >> The other option is to run CharacterFilter, > > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map > known > > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> > > >> term365". As long as it is done on both indexing and query, they > will > > >> still match. You may have to have a bunch of them or write some > sort > > >> of lookup map. > > >> > > >> Regards, > > >> Alex. > > >> > > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > > >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > > >>> > > >>> Hi All, > > >>> > > >>> This is likely a rudimentary question, but I can’t seem to find a > > >> straight-forward answer on forums or the documentation…is there a > way to > > >> protect tokens from ANY analysis? I know things like the > > >> KeywordMarkerFilterFactory protect tokens from stemming, but we > have > > some > > >> terms we don’t even want our tokenizer to touch. Mostly, these are > > >> IBM-specific acronyms, such as IT:ibm. In this case, we would > want to > > >> maintain the colon and the capitalization (otherwise “it” would > be taken > > >> out as a stopword). > > >>> > > >>> Any advice is appreciated! > > >>> > > >>> Thank you, > > >>> Audrey > > >>> > > >>> -- > > >>> Audrey Lorberfeld > > >>> Data Scientist, w3 Search > > >>> IBM > > >>> audrey.lorberf...@ibm.com > > >>> > > >> > > > > > > > > > > > > >