Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood Thu, 07 Nov 2019 07:32:00 -0800

I normally use a weight of 8 for the most important field, like title. Other 
fields might get a 4 or 2.


I add a “pf” field with the weights doubled, so that phrase matches have a 
higher weight.

The weight of 8 comes from experience at Infoseek and Inktomi, two early web 
search engines. With different relevance algorithms and totally different 
evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML 
titles. With the the two radically different system getting the same number, I 
decided that was a property of the documents, not of the search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
> Hi Wunder,
> 
> My indexer takes quite a few hours to be executed I am shortening it to run 
> faster, but I also need to make sure it gives what we are expecting. This 
> implementation's been there for >4y, and massively used.
> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
> I've inherited that implementation and I am really keen to adequate it, what 
> would you recommend ?
> 
> Cheers
> Guilherme
> 
>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org> wrote:
>> 
>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>> are using StopFilterFactory. The first advice we gave you was to remove that.
>> 
>> Remove StopFilterFactory everywhere and reindex.
>> 
>> You will continue to have problems matching stopwords until you do that.
>> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>> 
>>> Hi Paras, everyone
>>> 
>>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>>> trouble with the attachments I will host it somewhere and share the links. 
>>> I don't tweak my index, I get the data from the graph database, create a 
>>> document as they are and save to solr.
>>> 
>>> So, I am sending the new analysis screen querying the way you suggested. 
>>> Also the results with params and solr query url.
>>> 
>>> During the process of querying what you asked I found something really 
>>> weird (at least for me). By accident, I ended up querying the using the 
>>> default handler (/select) and it worked. Then If I use the one I must use, 
>>> then sadly doesn't work. I am posting both results and I will also post the 
>>> handlers as well.
>>> 
>>> Here is the link with all the files mentioned before
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>> If the link doesn't work www dot dropbox dot com slash sh slash 
>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>> 
>>> Thanks
>>> 
>>>> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com> wrote:
>>>> 
>>>> Hi Guilherme.
>>>> 
>>>> I am sending they analysis result and the json result as requested.
>>>> 
>>>> 
>>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>>> though).
>>>> 
>>>> From the analysis screen, the analysis is working as expected. One of the
>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>>> think of is: the stopword "a" is probably present in post-analysis either
>>>> of query or index. Did you tweak your index time analysis after indexing?
>>>> 
>>>> Do two things:
>>>> 
>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>> "query=*"lymphoid
>>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>>> here.
>>>> 2. Give the same JSON output as you have sent but this time with
>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>> 
>>>> 
>>>> 
>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> 
>>>> wrote:
>>>> 
>>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>>>> Apache server is fairly aggressive about stripping attachments though, so
>>>>> it’s also possible they didn’t make it through.
>>>>> 
>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>>>>> 
>>>>>> Thanks Erick.
>>>>>> 
>>>>>>> First, your index and analysis chains are considerably different, this
>>>>> can easily be a source of problems. In particular, using two different
>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>> you’re totally sure you understand the consequences. Additionally, your 
>>>>> use
>>>>> of the length filter is suspicious, especially since your problem 
>>>>> statement
>>>>> is about the addition of a single letter term and the min length allowed 
>>>>> on
>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>> interactions.
>>>>>> I will investigate the min length and post the results later.
>>>>>> 
>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>> Used by custom code?
>>>>>> This the url in my application, not solr params. That's the query string.
>>>>>> 
>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>> typo.
>>>>>> This is part of the application. Species will be used later on in solr
>>>>> to filter out the result. That's not solr. That my app params.
>>>>>> 
>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all 
>>>>> the
>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>> that part.
>>>>>> The two json files i've sent, they are debugQuery=on and the explain tag
>>>>> is present.
>>>>>> I will try the searching the way you mentioned.
>>>>>> 
>>>>>> Thank for your inputs
>>>>>> 
>>>>>> Guilherme
>>>>>> 
>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Fwd to another server
>>>>>>> 
>>>>>>> First, your index and analysis chains are considerably different, this
>>>>> can easily be a source of problems. In particular, using two different
>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>> you’re totally sure you understand the consequences. Additionally, your 
>>>>> use
>>>>> of the length filter is suspicious, especially since your problem 
>>>>> statement
>>>>> is about the addition of a single letter term and the min length allowed 
>>>>> on
>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>> interactions.
>>>>>>> 
>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>> Used by custom code?
>>>>>>> 
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>> 
>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>> typo.
>>>>>>> 
>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all 
>>>>> the
>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>> that part.
>>>>>>> 
>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>> expect” is answered by looking at the “&debug=query” output and the
>>>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>>>>> at _both_ the query and index output. Also, and very important about the
>>>>> analysis page (and this is confusing) is that this _assumes_ that what you
>>>>> put in the text boxes have made it through the query parser intact and is
>>>>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>>>>> Now you type “word1 word2” into the analysis text box and it looks like
>>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>>>>> helps.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Erick
>>>>>>> 
>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Walter,
>>>>>>>> 
>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>> will
>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>> includes
>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>> 
>>>>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>>>>> result you are concerned about and post full result of analysis screen
>>>>> (for
>>>>>>>> both query and index).
>>>>>>>> 
>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> No.
>>>>>>>>> 
>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>> 
>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>>>>> schema.xml.
>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>>>>> config.
>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>> 
>>>>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>>>>> removed and they will be searchable.
>>>>>>>>> 
>>>>>>>>> wunder
>>>>>>>>> Walter Underwood
>>>>>>>>> wun...@wunderwood.org
>>>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>>>> 
>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>>>>> result.
>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>> 
>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>>>>> then
>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>> 
>>>>>>>>>> Thanks David
>>>>>>>>>> 
>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>> hastings.recurs...@gmail.com
>>>>>>>>> <mailto:hastings.recurs...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Fwd to another server
>>>>>>>>>>> 
>>>>>>>>>>> no,
>>>>>>>>>>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>> 
>>>>>>>>>>> is still using stopwords and should be removed, in my opinion of
>>>>> course,
>>>>>>>>>>> based on your use case may be different, but i generally axe any
>>>>>>>>> reference
>>>>>>>>>>> to them at all
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>>>>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>    <analyzer type="index">
>>>>>>>>>>>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>        <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>        <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>> max="20"/>
>>>>>>>>>>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>> hastings.recurs...@gmail.com
>>>>>>>>> <mailto:hastings.recurs...@gmail.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The first thing you should do is remove any reference to stop
>>>>> words
>>>>>>>>> and
>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>> gvit...@ebi.ac.uk
>>>>>>>>> <mailto:gvit...@ebi.ac.uk>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am performing a search to match a name (text_field), however
>>>>> this
>>>>>>>>> term
>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
>>>>> remove
>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>> <
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>> 
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>> required="true"
>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>    <analyzer type="query">
>>>>>>>>>>>>>>        <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>        <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>        <filter class="solr.StopFilterFactory"
>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>    <analyzer type="index">
>>>>>>>>>>>>>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>        <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>        <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>        <filter class="solr.StopFilterFactory"
>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>>>    <analyzer type="query">
>>>>>>>>>>>>>>        <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>        <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>        <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>        <filter class="solr.StopFilterFactory"
>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>    </analyzer>
>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> b
>>>>>>>>>>>>>> c
>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>> 
>>>>>>>> *Paras Lehana* [65871]
>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>> 
>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>> Noida, UP, IN - 201303
>>>>>>>> 
>>>>>>>> Mob.: +91-9560911996
>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>> 
>>>>>>>> --
>>>>>>>> IMPORTANT:
>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> -- 
>>>> Regards,
>>>> 
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>> 
>>>> -- 
>>>> IMPORTANT: 
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>> 
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to