Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood Thu, 07 Nov 2019 06:44:54 -0800

Thanks for posting the files. Looking at schema.xml, I see that you still are 
using StopFilterFactory. The first advice we gave you was to remove that.


Remove StopFilterFactory everywhere and reindex.

You will continue to have problems matching stopwords until you do that.

In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
don’t think I’ve ever used a weight higher than 16 in a dozen years of 
configuring Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
> Hi Paras, everyone
> 
> Thank you again for your inputs and suggestions. I sorry to hear you had 
> trouble with the attachments I will host it somewhere and share the links. 
> I don't tweak my index, I get the data from the graph database, create a 
> document as they are and save to solr.
> 
> So, I am sending the new analysis screen querying the way you suggested. Also 
> the results with params and solr query url.
> 
> During the process of querying what you asked I found something really weird 
> (at least for me). By accident, I ended up querying the using the default 
> handler (/select) and it worked. Then If I use the one I must use, then sadly 
> doesn't work. I am posting both results and I will also post the handlers as 
> well.
> 
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> If the link doesn't work www dot dropbox dot com slash sh slash 
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> 
> Thanks
> 
>> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com> wrote:
>> 
>> Hi Guilherme.
>> 
>> I am sending they analysis result and the json result as requested.
>> 
>> 
>> Thanks for the effort. Luckily, I can see your attachments (low quality
>> though).
>> 
>> From the analysis screen, the analysis is working as expected. One of the
>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>> think of is: the stopword "a" is probably present in post-analysis either
>> of query or index. Did you tweak your index time analysis after indexing?
>> 
>> Do two things:
>> 
>>  1. Post the analysis screen for and index=*"Immunoregulatory
>>  interactions between a Lymphoid and a non-Lymphoid cell"* and
>> "query=*"lymphoid
>>  and a non-lymphoid cell"*. Try hosting the image and providing the link
>>  here.
>>  2. Give the same JSON output as you have sent but this time with
>>  *"echoParams=all"*. Also, post the exact Solr query url.
>> 
>> 
>> 
>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>> Apache server is fairly aggressive about stripping attachments though, so
>>> it’s also possible they didn’t make it through.
>>> 
>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>>> 
>>>> Thanks Erick.
>>>> 
>>>>> First, your index and analysis chains are considerably different, this
>>> can easily be a source of problems. In particular, using two different
>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>> you’re totally sure you understand the consequences. Additionally, your use
>>> of the length filter is suspicious, especially since your problem statement
>>> is about the addition of a single letter term and the min length allowed on
>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>> filtered out in both cases, but maybe you’ve found something odd about the
>>> interactions.
>>>> I will investigate the min length and post the results later.
>>>> 
>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>> Used by custom code?
>>>> This the url in my application, not solr params. That's the query string.
>>>> 
>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>> all the params with an equal-sign are totally ignored unless it’s just a
>>> typo.
>>>> This is part of the application. Species will be used later on in solr
>>> to filter out the result. That's not solr. That my app params.
>>>> 
>>>>> Third, the easiest way to see what’s happening under the covers is to
>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>> that part.
>>>> The two json files i've sent, they are debugQuery=on and the explain tag
>>> is present.
>>>> I will try the searching the way you mentioned.
>>>> 
>>>> Thank for your inputs
>>>> 
>>>> Guilherme
>>>> 
>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Fwd to another server
>>>>> 
>>>>> First, your index and analysis chains are considerably different, this
>>> can easily be a source of problems. In particular, using two different
>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>> you’re totally sure you understand the consequences. Additionally, your use
>>> of the length filter is suspicious, especially since your problem statement
>>> is about the addition of a single letter term and the min length allowed on
>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>> filtered out in both cases, but maybe you’ve found something odd about the
>>> interactions.
>>>>> 
>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>> Used by custom code?
>>>>> 
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> 
>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>> all the params with an equal-sign are totally ignored unless it’s just a
>>> typo.
>>>>> 
>>>>> Third, the easiest way to see what’s happening under the covers is to
>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>> that part.
>>>>> 
>>>>> 90% + of the time, the question “why didn’t this query do what I
>>> expect” is answered by looking at the “&debug=query” output and the
>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>>> at _both_ the query and index output. Also, and very important about the
>>> analysis page (and this is confusing) is that this _assumes_ that what you
>>> put in the text boxes have made it through the query parser intact and is
>>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>>> Now you type “word1 word2” into the analysis text box and it looks like
>>> what you expect. That’s misleading because the query is _parsed_ as
>>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>>> helps.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com>
>>> wrote:
>>>>>> 
>>>>>> Hi Walter,
>>>>>> 
>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>> will
>>>>>>> not be in the index, so they can never match a query.
>>>>>> 
>>>>>> 
>>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>>> think he's using the filter factory correctly - the query chain
>>> includes
>>>>>> the filter as well so it should remove "a" while querying.
>>>>>> 
>>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>>> result you are concerned about and post full result of analysis screen
>>> (for
>>>>>> both query and index).
>>>>>> 
>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org>
>>> wrote:
>>>>>> 
>>>>>>> No.
>>>>>>> 
>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>> will not be in the index, so they can never match a query.
>>>>>>> 
>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>>> schema.xml.
>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>>> config.
>>>>>>> 3. Reindex all of the documents.
>>>>>>> 
>>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>>> removed and they will be searchable.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org
>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>> 
>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>> wrote:
>>>>>>>> 
>>>>>>>> Ok. I am kind a lost now.
>>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>>> result.
>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>> 
>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>>> then
>>>>>>> add to solr. Is that correct ?
>>>>>>>> 
>>>>>>>> Thanks David
>>>>>>>> 
>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>> hastings.recurs...@gmail.com
>>>>>>> <mailto:hastings.recurs...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Fwd to another server
>>>>>>>>> 
>>>>>>>>> no,
>>>>>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>> 
>>>>>>>>> is still using stopwords and should be removed, in my opinion of
>>> course,
>>>>>>>>> based on your use case may be different, but i generally axe any
>>>>>>> reference
>>>>>>>>> to them at all
>>>>>>>>> 
>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>      <analyzer type="index">
>>>>>>>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>          <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
>>>>>>> max="20"/>
>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>      </analyzer>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>> hastings.recurs...@gmail.com
>>>>>>> <mailto:hastings.recurs...@gmail.com>>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Fwd to another server
>>>>>>>>>>> 
>>>>>>>>>>> The first thing you should do is remove any reference to stop
>>> words
>>>>>>> and
>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>> gvit...@ebi.ac.uk
>>>>>>> <mailto:gvit...@ebi.ac.uk>>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I am performing a search to match a name (text_field), however
>>> this
>>>>>>> term
>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
>>> remove
>>>>>>>>>> 'a'
>>>>>>>>>>>> then it works.
>>>>>>>>>>>> e.g
>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>> <
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> 
>>>>>>>>>>>> <
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>> works:
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>> <
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>>>>> 
>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>> 
>>>>>>>>>>>> schema.xml
>>>>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>> required="true"
>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>> 
>>>>>>>>>>>>      <analyzer type="query">
>>>>>>>>>>>>          <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>> max="20"/>
>>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
>>>>>>> ignoreCase="true"
>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>      </analyzer>
>>>>>>>>>>>> 
>>>>>>>>>>>>  <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>      <analyzer type="index">
>>>>>>>>>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>          <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>> max="20"/>
>>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
>>>>>>> ignoreCase="true"
>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>      </analyzer>
>>>>>>>>>>>>      <analyzer type="query">
>>>>>>>>>>>>          <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>> max="20"/>
>>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
>>>>>>> ignoreCase="true"
>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>      </analyzer>
>>>>>>>>>>>>  </fieldType>
>>>>>>>>>>>> 
>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>>>>> a
>>>>>>>>>>>> b
>>>>>>>>>>>> c
>>>>>>>>>>>> ....
>>>>>>>>>>>> an
>>>>>>>>>>>> and
>>>>>>>>>>>> are
>>>>>>>>>>>> 
>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Guilherme
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> --
>>>>>> Regards,
>>>>>> 
>>>>>> *Paras Lehana* [65871]
>>>>>> Development Engineer, Auto-Suggest,
>>>>>> IndiaMART Intermesh Ltd.
>>>>>> 
>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>> Noida, UP, IN - 201303
>>>>>> 
>>>>>> Mob.: +91-9560911996
>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>> 
>>>>>> --
>>>>>> IMPORTANT:
>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> -- 
>> -- 
>> Regards,
>> 
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>> 
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>> 
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>> 
>> -- 
>> IMPORTANT: 
>> NEVER share your IndiaMART OTP/ Password with anyone.
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to