Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

David Hastings Fri, 08 Nov 2019 10:17:52 -0800

I use 3 word shingles with stopwords for my MLT ML trainer that worked
pretty well for such a solution, but for a full index the size became
prohibitive


On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> If we had IDF for phrases, they would be super effective. The 2X weight is
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 8, 2019, at 11:08 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > the pf and qf fields are REALLY nice for this
> >
> > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <wun...@wunderwood.org>
> > wrote:
> >
> >> I always enable phrase searching in edismax for exactly this reason.
> >>
> >> Something like:
> >>
> >>       <str name="qf”>title^8 keywords^4 text</str>
> >>       <str name="pf”>title^16 keywords^8 text^2</str>
> >>
> >> To deal with concepts in queries, a classifier and/or named entity
> >> extractor can be helpful. If you have a list of concepts (“controlled
> >> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
> >> term can be queried against the field matching that vocabulary.
> >>
> >> This is how LinkedIn separates people, companies, and places, for
> example.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com>
> >> wrote:
> >>>
> >>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
> >> entirely likely to do what you want either since virtually every doc
> will
> >> have “a” in it. But at least you’d get docs that have both terms.
> >>>
> >>> you may also be able to search for things like “Lamin A” _only as a
> >> phrase_ and have some luck. But this is a gnarly problem in general.
> Some
> >> people have been able to substitute synonyms and/or shingles to make
> this
> >> work at the expense of a larger index.
> >>>
> >>> This is a generic problem with context. “Lamin A” is really a
> “concept”,
> >> not just two words that happen to be near each other. Searching as a
> phrase
> >> is an OOB-but-naive way to try to make it more likely that the ranked
> >> results refer to the _concept_ of “Lamin A”. The assumption here is “if
> >> these two words appear next to each other, they’re more likely to be
> what I
> >> want”. I say “naive” because “Lamins: A new approach to...” would
> _also_ be
> >> found for a naive phrase search. (I have no idea whether such a title
> makes
> >> sense or not, but you figured that out already)...
> >>>
> >>> To do this well you’d have to dive in to NLP/Machine learning.
> >>>
> >>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
> >>>
> >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
> >> wrote:
> >>>>
> >>>> HI Walter and Paras
> >>>>
> >>>> I indexed it removing all the references to StopWordFilter and I went
> >> from 121 results to near 20K as the search term q="Lymphoid and a
> >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> So I
> >> don't think removing it completely is the way to go from the scenario we
> >> have, but I appreciate the suggestion…
> >>>>
> >>>> Yes the response is using fl=*
> >>>> I am trying some combinations at the moment, but yet no success.
> >>>>
> >>>> defType=edismax
> >>>> q.alt=Lymphoid and a non-Lymphoid cell
> >>>> Number of results=1599
> >>>> Quite a considerable increase, even though reasonable meaningful
> >> results.
> >>>>
> >>>> I am sorry but I didn't understand what do you want me to do exactly
> >> with the lst (??) and qf and bf.
> >>>>
> >>>> Thanks everyone with their inputs
> >>>>
> >>>>
> >>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
> >> wrote:
> >>>>>
> >>>>> Hi Guilherme
> >>>>>
> >>>>> By accident, I ended up querying the using the default handler
> >> (/select) and it worked.
> >>>>>
> >>>>> You've just found the culprit. Thanks for giving the material I
> >> requested. Your analysis chain is working as expected. I don't see any
> >> issue in either StopWordFilter or your boosts. I also use a boost of 50
> >> when boosting contextual suggestions (boosting "gold iphone" on a page
> of
> >> iphone) but I take Walter's suggestion and would try to optimize my
> >> weights. I agree that this 50 thing was not researched much about by us
> as
> >> well (we never faced performance or relevance issues).
> >>>>>
> >>>>> See the major difference in both the handlers - edismax. I'm pretty
> >> sure that your problem lies in the parsing of queries (you can confirm
> that
> >> from parsedquery key in debug of both JSON responses). I hope you have
> >> provided the response with fl=*. Replace q with q.alt in your /search
> >> handler query and I think you should start getting responses. That's
> >> because q.alt uses standard parser. If you want to keep using edisMax, I
> >> suggest you to test the responses removing some combination of lst (qf,
> bf)
> >> and find what's restricting the documents to come up. I'm out of office
> >> today - would have certainly tried analyzing the field values of the
> >> document in /select request and compare it with qf/bq in solrconfig.xml
> >> /search. Do this for me and you'd certainly find something.
> >>>>>
> >>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <wun...@wunderwood.org
> >> <mailto:wun...@wunderwood.org>> wrote:
> >>>>> I normally use a weight of 8 for the most important field, like
> title.
> >> Other fields might get a 4 or 2.
> >>>>>
> >>>>> I add a “pf” field with the weights doubled, so that phrase matches
> >> have a higher weight.
> >>>>>
> >>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> >> early web search engines. With different relevance algorithms and
> totally
> >> different evaluation and tuning systems, they settled on weights of 8
> and
> >> 7.5 for HTML titles. With the the two radically different system getting
> >> the same number, I decided that was a property of the documents, not of
> the
> >> search engines.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> >> (my blog)
> >>>>>
> >>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
> >> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>
> >>>>>> Hi Wunder,
> >>>>>>
> >>>>>> My indexer takes quite a few hours to be executed I am shortening it
> >> to run faster, but I also need to make sure it gives what we are
> expecting.
> >> This implementation's been there for >4y, and massively used.
> >>>>>>
> >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> years
> >> of configuring Solr.
> >>>>>> I've inherited that implementation and I am really keen to adequate
> >> it, what would you recommend ?
> >>>>>>
> >>>>>> Cheers
> >>>>>> Guilherme
> >>>>>>
> >>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
> >> <mailto:wun...@wunderwood.org>> wrote:
> >>>>>>>
> >>>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
> >> still are using StopFilterFactory. The first advice we gave you was to
> >> remove that.
> >>>>>>>
> >>>>>>> Remove StopFilterFactory everywhere and reindex.
> >>>>>>>
> >>>>>>> You will continue to have problems matching stopwords until you do
> >> that.
> >>>>>>>
> >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> years
> >> of configuring Solr.
> >>>>>>>
> >>>>>>> wunder
> >>>>>>> Walter Underwood
> >>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> >> (my blog)
> >>>>>>>
> >>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
> >> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Paras, everyone
> >>>>>>>>
> >>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> >> you had trouble with the attachments I will host it somewhere and share
> the
> >> links.
> >>>>>>>> I don't tweak my index, I get the data from the graph database,
> >> create a document as they are and save to solr.
> >>>>>>>>
> >>>>>>>> So, I am sending the new analysis screen querying the way you
> >> suggested. Also the results with params and solr query url.
> >>>>>>>>
> >>>>>>>> During the process of querying what you asked I found something
> >> really weird (at least for me). By accident, I ended up querying the
> using
> >> the default handler (/select) and it worked. Then If I use the one I
> must
> >> use, then sadly doesn't work. I am posting both results and I will also
> >> post the handlers as well.
> >>>>>>>>
> >>>>>>>> Here is the link with all the files mentioned before
> >>>>>>>>
> >>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> >>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> >> <
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >> <
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >>>>
> >>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> >> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
> paras.leh...@indiamart.com
> >> <mailto:paras.leh...@indiamart.com>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Guilherme.
> >>>>>>>>>
> >>>>>>>>> I am sending they analysis result and the json result as
> requested.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
> >> quality
> >>>>>>>>> though).
> >>>>>>>>>
> >>>>>>>>> From the analysis screen, the analysis is working as expected.
> One
> >> of the
> >>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
> matching
> >>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
> >> initially
> >>>>>>>>> think of is: the stopword "a" is probably present in
> post-analysis
> >> either
> >>>>>>>>> of query or index. Did you tweak your index time analysis after
> >> indexing?
> >>>>>>>>>
> >>>>>>>>> Do two things:
> >>>>>>>>>
> >>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
> >>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
> >>>>>>>>> "query=*"lymphoid
> >>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
> the
> >> link
> >>>>>>>>> here.
> >>>>>>>>> 2. Give the same JSON output as you have sent but this time with
> >>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
> >> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
> >> such. The
> >>>>>>>>>> Apache server is fairly aggressive about stripping attachments
> >> though, so
> >>>>>>>>>> it’s also possible they didn’t make it through.
> >>>>>>>>>>
> >>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
> gvit...@ebi.ac.uk
> >> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Erick.
> >>>>>>>>>>>
> >>>>>>>>>>>> First, your index and analysis chains are considerably
> >> different, this
> >>>>>>>>>> can easily be a source of problems. In particular, using two
> >> different
> >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> >> this unless
> >>>>>>>>>> you’re totally sure you understand the consequences.
> >> Additionally, your use
> >>>>>>>>>> of the length filter is suspicious, especially since your
> problem
> >> statement
> >>>>>>>>>> is about the addition of a single letter term and the min length
> >> allowed on
> >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> >> ’a’ is
> >>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> >> about the
> >>>>>>>>>> interactions.
> >>>>>>>>>>> I will investigate the min length and post the results later.
> >>>>>>>>>>>
> >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> >> typos?
> >>>>>>>>>> Used by custom code?
> >>>>>>>>>>> This the url in my application, not solr params. That's the
> >> query string.
> >>>>>>>>>>>
> >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> likely
> >> that
> >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> it’s
> >> just a
> >>>>>>>>>> typo.
> >>>>>>>>>>> This is part of the application. Species will be used later on
> >> in solr
> >>>>>>>>>> to filter out the result. That's not solr. That my app params.
> >>>>>>>>>>>
> >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> covers
> >> is to
> >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> >> Ignore all the
> >>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> >> to skip
> >>>>>>>>>> that part.
> >>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> >> explain tag
> >>>>>>>>>> is present.
> >>>>>>>>>>> I will try the searching the way you mentioned.
> >>>>>>>>>>>
> >>>>>>>>>>> Thank for your inputs
> >>>>>>>>>>>
> >>>>>>>>>>> Guilherme
> >>>>>>>>>>>
> >>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> >> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>
> >>>>>>>>>>>> First, your index and analysis chains are considerably
> >> different, this
> >>>>>>>>>> can easily be a source of problems. In particular, using two
> >> different
> >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> >> this unless
> >>>>>>>>>> you’re totally sure you understand the consequences.
> >> Additionally, your use
> >>>>>>>>>> of the length filter is suspicious, especially since your
> problem
> >> statement
> >>>>>>>>>> is about the addition of a single letter term and the min length
> >> allowed on
> >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> >> ’a’ is
> >>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> >> about the
> >>>>>>>>>> interactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> >> typos?
> >>>>>>>>>> Used by custom code?
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> likely
> >> that
> >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> it’s
> >> just a
> >>>>>>>>>> typo.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> covers
> >> is to
> >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> >> Ignore all the
> >>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> >> to skip
> >>>>>>>>>> that part.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
> what I
> >>>>>>>>>> expect” is answered by looking at the “&debug=query” output and
> >> the
> >>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> >> sure to look
> >>>>>>>>>> at _both_ the query and index output. Also, and very important
> >> about the
> >>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
> that
> >> what you
> >>>>>>>>>> put in the text boxes have made it through the query parser
> >> intact and is
> >>>>>>>>>> analyzed by the field selected. Consider the search
> >> "q=field:word1 word2".
> >>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
> >> looks like
> >>>>>>>>>> what you expect. That’s misleading because the query is _parsed_
> >> as
> >>>>>>>>>> "field:word1 default_search_field:word2”. This is where
> >> “&debug=query”
> >>>>>>>>>> helps.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Erick
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> >> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Walter,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> >> Those words
> >>>>>>>>>> will
> >>>>>>>>>>>>>> not be in the index, so they can never match a query.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think the OP's concern is different results when adding a
> >> stopword. I
> >>>>>>>>>>>>> think he's using the filter factory correctly - the query
> chain
> >>>>>>>>>> includes
> >>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
> >> document in
> >>>>>>>>>>>>> result you are concerned about and post full result of
> >> analysis screen
> >>>>>>>>>> (for
> >>>>>>>>>>>>> both query and index).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> No.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> >> Those words
> >>>>>>>>>>>>>> will not be in the index, so they can never match a query.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
> >> chain in
> >>>>>>>>>>>>>> schema.xml.
> >>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read
> >> the new
> >>>>>>>>>> config.
> >>>>>>>>>>>>>> 3. Reindex all of the documents.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
> >> not be
> >>>>>>>>>>>>>> removed and they will be searchable.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wunder
> >>>>>>>>>>>>>> Walter Underwood
> >>>>>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>>>>>>>>>>> http://observer.wunderwood.org/ <
> >> http://observer.wunderwood.org/>  (my blog)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
> >> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Ok. I am kind a lost now.
> >>>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's
> >> the final
> >>>>>>>>>>>>>> result.
> >>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in
> >> the
> >>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
> >> stopwords.txt"," ")
> >>>>>>>>>> then
> >>>>>>>>>>>>>> add to solr. Is that correct ?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks David
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> >>>>>>>>>> hastings.recurs...@gmail.com <mailto:
> hastings.recurs...@gmail.com
> >>>
> >>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
> >> hastings.recurs...@gmail.com>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> no,
> >>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
> >> opinion of
> >>>>>>>>>> course,
> >>>>>>>>>>>>>>>> based on your use case may be different, but i generally
> >> axe any
> >>>>>>>>>>>>>> reference
> >>>>>>>>>>>>>>>> to them at all
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> >> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
> >>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>> Haven't I done this here ?
> >>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>>> <analyzer type="index">
> >>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> >>>>>>>>>> hastings.recurs...@gmail.com <mailto:
> hastings.recurs...@gmail.com
> >>>
> >>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
> >> hastings.recurs...@gmail.com>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
> >> stop
> >>>>>>>>>> words
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
> again.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> >>>>>>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
> >>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
> >> however
> >>>>>>>>>> this
> >>>>>>>>>>>>>> term
> >>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
> >> records. If i
> >>>>>>>>>> remove
> >>>>>>>>>>>>>>>>> 'a'
> >>>>>>>>>>>>>>>>>>> then it works.
> >>>>>>>>>>>>>>>>>>> e.g
> >>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>>>>>>>>>>>>>>>> doesn't work:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> >>>>>>>>>>>>>>>>>>> works:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> interested in the first result
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> schema.xml
> >>>>>>>>>>>>>>>>>>> <field name="name"
> >> type="text_field"
> >>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> >>>>>>>>>> required="true"
> >>>>>>>>>>>>>>>>>>> multiValued="false"/>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>>>>> <analyzer type="index">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>> </fieldType>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> stopwords.txt
> >>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
> >> StopAnalyzer
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>>>> c
> >>>>>>>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>> Guilherme
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Paras Lehana* [65871]
> >>>>>>>>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>>>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>>>>>>>> Noida, UP, IN - 201303
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Mob.: +91-9560911996
> >>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> IMPORTANT:
> >>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> *Paras Lehana* [65871]
> >>>>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>>>
> >>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>>>> Noida, UP, IN - 201303
> >>>>>>>>>
> >>>>>>>>> Mob.: +91-9560911996
> >>>>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> IMPORTANT:
> >>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> --
> >>>>> Regards,
> >>>>>
> >>>>> Paras Lehana [65871]
> >>>>> Development Engineer, Auto-Suggest,
> >>>>> IndiaMART Intermesh Ltd.
> >>>>>
> >>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>> Noida, UP, IN - 201303
> >>>>>
> >>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
> >>>>> Work: 01203916600 | Extn:  8173
> >>>>>
> >>>>> IMPORTANT:
> >>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>
> >>
> >>
>
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to