Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri Thu, 07 Nov 2019 07:04:19 -0800

Hi Wunder,

My indexer takes quite a few hours to be executed I am shortening it to run 
faster, but I also need to make sure it gives what we are expecting. This 
implementation's been there for >4y, and massively used.


> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
I've inherited that implementation and I am really keen to adequate it, what 
would you recommend ?

Cheers
Guilherme

> On 7 Nov 2019, at 14:43, Walter Underwood <[email protected]> wrote:
> 
> Thanks for posting the files. Looking at schema.xml, I see that you still are 
> using StopFilterFactory. The first advice we gave you was to remove that.
> 
> Remove StopFilterFactory everywhere and reindex.
> 
> You will continue to have problems matching stopwords until you do that.
> 
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
> 
> wunder
> Walter Underwood
> [email protected]
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[email protected]> wrote:
>> 
>> Hi Paras, everyone
>> 
>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>> trouble with the attachments I will host it somewhere and share the links. 
>> I don't tweak my index, I get the data from the graph database, create a 
>> document as they are and save to solr.
>> 
>> So, I am sending the new analysis screen querying the way you suggested. 
>> Also the results with params and solr query url.
>> 
>> During the process of querying what you asked I found something really weird 
>> (at least for me). By accident, I ended up querying the using the default 
>> handler (/select) and it worked. Then If I use the one I must use, then 
>> sadly doesn't work. I am posting both results and I will also post the 
>> handlers as well.
>> 
>> Here is the link with all the files mentioned before
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> If the link doesn't work www dot dropbox dot com slash sh slash 
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>> 
>> Thanks
>> 
>>> On 7 Nov 2019, at 05:23, Paras Lehana <[email protected]> wrote:
>>> 
>>> Hi Guilherme.
>>> 
>>> I am sending they analysis result and the json result as requested.
>>> 
>>> 
>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>> though).
>>> 
>>> From the analysis screen, the analysis is working as expected. One of the
>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>> think of is: the stopword "a" is probably present in post-analysis either
>>> of query or index. Did you tweak your index time analysis after indexing?
>>> 
>>> Do two things:
>>> 
>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>> "query=*"lymphoid
>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>> here.
>>> 2. Give the same JSON output as you have sent but this time with
>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>> 
>>> 
>>> 
>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <[email protected]> wrote:
>>> 
>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>>> Apache server is fairly aggressive about stripping attachments though, so
>>>> it’s also possible they didn’t make it through.
>>>> 
>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[email protected]> wrote:
>>>>> 
>>>>> Thanks Erick.
>>>>> 
>>>>>> First, your index and analysis chains are considerably different, this
>>>> can easily be a source of problems. In particular, using two different
>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>> of the length filter is suspicious, especially since your problem statement
>>>> is about the addition of a single letter term and the min length allowed on
>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>> interactions.
>>>>> I will investigate the min length and post the results later.
>>>>> 
>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>> Used by custom code?
>>>>> This the url in my application, not solr params. That's the query string.
>>>>> 
>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>> typo.
>>>>> This is part of the application. Species will be used later on in solr
>>>> to filter out the result. That's not solr. That my app params.
>>>>> 
>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>> that part.
>>>>> The two json files i've sent, they are debugQuery=on and the explain tag
>>>> is present.
>>>>> I will try the searching the way you mentioned.
>>>>> 
>>>>> Thank for your inputs
>>>>> 
>>>>> Guilherme
>>>>> 
>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>> Fwd to another server
>>>>>> 
>>>>>> First, your index and analysis chains are considerably different, this
>>>> can easily be a source of problems. In particular, using two different
>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>> of the length filter is suspicious, especially since your problem statement
>>>> is about the addition of a single letter term and the min length allowed on
>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>> interactions.
>>>>>> 
>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>> Used by custom code?
>>>>>> 
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>> typo.
>>>>>> 
>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>> that part.
>>>>>> 
>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>> expect” is answered by looking at the “&debug=query” output and the
>>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>>>> at _both_ the query and index output. Also, and very important about the
>>>> analysis page (and this is confusing) is that this _assumes_ that what you
>>>> put in the text boxes have made it through the query parser intact and is
>>>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>>>> Now you type “word1 word2” into the analysis text box and it looks like
>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>>>> helps.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi Walter,
>>>>>>> 
>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>> will
>>>>>>>> not be in the index, so they can never match a query.
>>>>>>> 
>>>>>>> 
>>>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>>>> think he's using the filter factory correctly - the query chain
>>>> includes
>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>> 
>>>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>>>> result you are concerned about and post full result of analysis screen
>>>> (for
>>>>>>> both query and index).
>>>>>>> 
>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> No.
>>>>>>>> 
>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>> 
>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>>>> schema.xml.
>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>>>> config.
>>>>>>>> 3. Reindex all of the documents.
>>>>>>>> 
>>>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>>>> removed and they will be searchable.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> [email protected]
>>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>>> 
>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <[email protected]>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>>>> result.
>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>> 
>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>>>> then
>>>>>>>> add to solr. Is that correct ?
>>>>>>>>> 
>>>>>>>>> Thanks David
>>>>>>>>> 
>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>> [email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Fwd to another server
>>>>>>>>>> 
>>>>>>>>>> no,
>>>>>>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>> 
>>>>>>>>>> is still using stopwords and should be removed, in my opinion of
>>>> course,
>>>>>>>>>> based on your use case may be different, but i generally axe any
>>>>>>>> reference
>>>>>>>>>> to them at all
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <[email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>     <analyzer type="index">
>>>>>>>>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>         <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>         <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>> max="20"/>
>>>>>>>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>     </analyzer>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>> [email protected]
>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>> 
>>>>>>>>>>>> The first thing you should do is remove any reference to stop
>>>> words
>>>>>>>> and
>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>> [email protected]
>>>>>>>> <mailto:[email protected]>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am performing a search to match a name (text_field), however
>>>> this
>>>>>>>> term
>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
>>>> remove
>>>>>>>>>>> 'a'
>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>> e.g
>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>> 
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>> works:
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>> <
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>> 
>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>> required="true"
>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>     <analyzer type="query">
>>>>>>>>>>>>>         <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>         <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>         <filter class="solr.StopFilterFactory"
>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>     </analyzer>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>     <analyzer type="index">
>>>>>>>>>>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>         <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>         <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>         <filter class="solr.StopFilterFactory"
>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>     </analyzer>
>>>>>>>>>>>>>     <analyzer type="query">
>>>>>>>>>>>>>         <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>         <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>         <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>         <filter class="solr.StopFilterFactory"
>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>     </analyzer>
>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>>>>>> a
>>>>>>>>>>>>> b
>>>>>>>>>>>>> c
>>>>>>>>>>>>> ....
>>>>>>>>>>>>> an
>>>>>>>>>>>>> and
>>>>>>>>>>>>> are
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Guilherme
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> --
>>>>>>> Regards,
>>>>>>> 
>>>>>>> *Paras Lehana* [65871]
>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>> 
>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>> Noida, UP, IN - 201303
>>>>>>> 
>>>>>>> Mob.: +91-9560911996
>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>> 
>>>>>>> --
>>>>>>> IMPORTANT:
>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> -- 
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> -- 
>>> IMPORTANT: 
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>> 
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to