Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood Fri, 08 Nov 2019 09:15:04 -0800

If we had IDF for phrases, they would be super effective. The 2X weight is a 
hack that mostly works.


Infoseek had phrase IDF and it was a killer algorithm for relevance.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2019, at 11:08 AM, David Hastings <hastings.recurs...@gmail.com> 
> wrote:
> 
> the pf and qf fields are REALLY nice for this
> 
> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> I always enable phrase searching in edismax for exactly this reason.
>> 
>> Something like:
>> 
>>       <str name="qf”>title^8 keywords^4 text</str>
>>       <str name="pf”>title^16 keywords^8 text^2</str>
>> 
>> To deal with concepts in queries, a classifier and/or named entity
>> extractor can be helpful. If you have a list of concepts (“controlled
>> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
>> term can be queried against the field matching that vocabulary.
>> 
>> This is how LinkedIn separates people, companies, and places, for example.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>> 
>>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
>> entirely likely to do what you want either since virtually every doc will
>> have “a” in it. But at least you’d get docs that have both terms.
>>> 
>>> you may also be able to search for things like “Lamin A” _only as a
>> phrase_ and have some luck. But this is a gnarly problem in general. Some
>> people have been able to substitute synonyms and/or shingles to make this
>> work at the expense of a larger index.
>>> 
>>> This is a generic problem with context. “Lamin A” is really a “concept”,
>> not just two words that happen to be near each other. Searching as a phrase
>> is an OOB-but-naive way to try to make it more likely that the ranked
>> results refer to the _concept_ of “Lamin A”. The assumption here is “if
>> these two words appear next to each other, they’re more likely to be what I
>> want”. I say “naive” because “Lamins: A new approach to...” would _also_ be
>> found for a naive phrase search. (I have no idea whether such a title makes
>> sense or not, but you figured that out already)...
>>> 
>>> To do this well you’d have to dive in to NLP/Machine learning.
>>> 
>>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
>>> 
>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>> wrote:
>>>> 
>>>> HI Walter and Paras
>>>> 
>>>> I indexed it removing all the references to StopWordFilter and I went
>> from 121 results to near 20K as the search term q="Lymphoid and a
>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A". So I
>> don't think removing it completely is the way to go from the scenario we
>> have, but I appreciate the suggestion…
>>>> 
>>>> Yes the response is using fl=*
>>>> I am trying some combinations at the moment, but yet no success.
>>>> 
>>>> defType=edismax
>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>> Number of results=1599
>>>> Quite a considerable increase, even though reasonable meaningful
>> results.
>>>> 
>>>> I am sorry but I didn't understand what do you want me to do exactly
>> with the lst (??) and qf and bf.
>>>> 
>>>> Thanks everyone with their inputs
>>>> 
>>>> 
>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
>> wrote:
>>>>> 
>>>>> Hi Guilherme
>>>>> 
>>>>> By accident, I ended up querying the using the default handler
>> (/select) and it worked.
>>>>> 
>>>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>>>> 
>>>>> See the major difference in both the handlers - edismax. I'm pretty
>> sure that your problem lies in the parsing of queries (you can confirm that
>> from parsedquery key in debug of both JSON responses). I hope you have
>> provided the response with fl=*. Replace q with q.alt in your /search
>> handler query and I think you should start getting responses. That's
>> because q.alt uses standard parser. If you want to keep using edisMax, I
>> suggest you to test the responses removing some combination of lst (qf, bf)
>> and find what's restricting the documents to come up. I'm out of office
>> today - would have certainly tried analyzing the field values of the
>> document in /select request and compare it with qf/bq in solrconfig.xml
>> /search. Do this for me and you'd certainly find something.
>>>>> 
>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <wun...@wunderwood.org
>> <mailto:wun...@wunderwood.org>> wrote:
>>>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>>>> 
>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>> have a higher weight.
>>>>> 
>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>> early web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>> 
>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>> 
>>>>>> Hi Wunder,
>>>>>> 
>>>>>> My indexer takes quite a few hours to be executed I am shortening it
>> to run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>>>> 
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>> I've inherited that implementation and I am really keen to adequate
>> it, what would you recommend ?
>>>>>> 
>>>>>> Cheers
>>>>>> Guilherme
>>>>>> 
>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
>> <mailto:wun...@wunderwood.org>> wrote:
>>>>>>> 
>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>>>>>> 
>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>> 
>>>>>>> You will continue to have problems matching stopwords until you do
>> that.
>>>>>>> 
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>>> 
>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>> 
>>>>>>>> Hi Paras, everyone
>>>>>>>> 
>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>> you had trouble with the attachments I will host it somewhere and share the
>> links.
>>>>>>>> I don't tweak my index, I get the data from the graph database,
>> create a document as they are and save to solr.
>>>>>>>> 
>>>>>>>> So, I am sending the new analysis screen querying the way you
>> suggested. Also the results with params and solr query url.
>>>>>>>> 
>>>>>>>> During the process of querying what you asked I found something
>> really weird (at least for me). By accident, I ended up querying the using
>> the default handler (/select) and it worked. Then If I use the one I must
>> use, then sadly doesn't work. I am posting both results and I will also
>> post the handlers as well.
>>>>>>>> 
>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>> 
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>> 
>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com
>> <mailto:paras.leh...@indiamart.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Guilherme.
>>>>>>>>> 
>>>>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>> quality
>>>>>>>>> though).
>>>>>>>>> 
>>>>>>>>> From the analysis screen, the analysis is working as expected. One
>> of the
>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>> initially
>>>>>>>>> think of is: the stopword "a" is probably present in post-analysis
>> either
>>>>>>>>> of query or index. Did you tweak your index time analysis after
>> indexing?
>>>>>>>>> 
>>>>>>>>> Do two things:
>>>>>>>>> 
>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>> "query=*"lymphoid
>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the
>> link
>>>>>>>>> here.
>>>>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
>> such. The
>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>> though, so
>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>> 
>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>> 
>>>>>>>>>>>> First, your index and analysis chains are considerably
>> different, this
>>>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>> this unless
>>>>>>>>>> you’re totally sure you understand the consequences.
>> Additionally, your use
>>>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>>>> interactions.
>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>> 
>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>>>> Used by custom code?
>>>>>>>>>>> This the url in my application, not solr params. That's the
>> query string.
>>>>>>>>>>> 
>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>>>> typo.
>>>>>>>>>>> This is part of the application. Species will be used later on
>> in solr
>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>> 
>>>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>> Ignore all the
>>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
>> to skip
>>>>>>>>>> that part.
>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>> explain tag
>>>>>>>>>> is present.
>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>> 
>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>> 
>>>>>>>>>>> Guilherme
>>>>>>>>>>> 
>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>> 
>>>>>>>>>>>> First, your index and analysis chains are considerably
>> different, this
>>>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>> this unless
>>>>>>>>>> you’re totally sure you understand the consequences.
>> Additionally, your use
>>>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>>>> interactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>>>> Used by custom code?
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>>>> typo.
>>>>>>>>>>>> 
>>>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>> Ignore all the
>>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
>> to skip
>>>>>>>>>> that part.
>>>>>>>>>>>> 
>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>>>>> expect” is answered by looking at the “&debug=query” output and
>> the
>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>> sure to look
>>>>>>>>>> at _both_ the query and index output. Also, and very important
>> about the
>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
>> what you
>>>>>>>>>> put in the text boxes have made it through the query parser
>> intact and is
>>>>>>>>>> analyzed by the field selected. Consider the search
>> "q=field:word1 word2".
>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>> looks like
>>>>>>>>>> what you expect. That’s misleading because the query is _parsed_
>> as
>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>> “&debug=query”
>>>>>>>>>> helps.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Erick
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>> will
>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>> stopword. I
>>>>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>>>>> includes
>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>> document in
>>>>>>>>>>>>> result you are concerned about and post full result of
>> analysis screen
>>>>>>>>>> (for
>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
>> chain in
>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read
>> the new
>>>>>>>>>> config.
>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
>> not be
>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's
>> the final
>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in
>> the
>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>> stopwords.txt"," ")
>>>>>>>>>> then
>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>> hastings.recurs...@gmail.com <mailto:hastings.recurs...@gmail.com
>>> 
>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>> hastings.recurs...@gmail.com>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>> opinion of
>>>>>>>>>> course,
>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>> axe any
>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>> hastings.recurs...@gmail.com <mailto:hastings.recurs...@gmail.com
>>> 
>>>>>>>>>>>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>> hastings.recurs...@gmail.com>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
>> stop
>>>>>>>>>> words
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>>>>>>>>>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
>> however
>>>>>>>>>> this
>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>> records. If i
>>>>>>>>>> remove
>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>> <field name="name"
>> type="text_field"
>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>> StopAnalyzer
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> 
>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>> 
>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>> 
>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> IMPORTANT:
>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> --
>>>>> Regards,
>>>>> 
>>>>> Paras Lehana [65871]
>>>>> Development Engineer, Auto-Suggest,
>>>>> IndiaMART Intermesh Ltd.
>>>>> 
>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>> Noida, UP, IN - 201303
>>>>> 
>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>> Work: 01203916600 | Extn:  8173
>>>>> 
>>>>> IMPORTANT:
>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>> 
>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to