Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri Wed, 20 Nov 2019 09:06:51 -0800

Hi,

Alright, after trying and trying, I have managed to isolate the fields that are 
causing the search to fail.
Now, all the fields are "<fieldType name="id" class="solr.StrField"/>" are 
breaking up my search.


I changed the id-StrField to 
        <fieldType name="id" class="solr.TextField">
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
            </analyzer>
        </fieldType>

And finally now it works, however I am just scared this is not correct or bad 
practice as I am dealing with IDs and they should be anyhow parsed.

What is your opinion ?

Thanks
Guilherme

> On 18 Nov 2019, at 15:42, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
> Hi,
> 
>> Have you tried reindexing the documents and compare the results? No issues
>> if you cannot do that - let's try something else. I was going through the
>> whole mail and your files. You had said:
> Yes, but since it hasn't worked as suggested, I kept as you suggested.
> 
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>>> don't get anything (which make sense).
>> 
>> Why did you think that not getting anything when you add dbId made sense?
>> Asking because I may be missing something here.
> I am searching for a text and I was searching on an ID field, which wouldn't 
> make sense.
> (I will come back to this soon.)
> 
> Ok, I've been adding and removing fields in the qf and I could isolate half 
> of the problem. First, I have one type of field called keyword_field and I 
> added the StopWords filter for this field and It worked. Second,
> when I add the fields that are id (<fieldType name="id" class="solr.StrField" 
> />
> 
> Do you think I should also the stopwords filter for the fieldtype id ?
> (I tried, and it worked, but I am not sure if this is conceptually correct, 
> id, should remain intact from my understand)
> 
> Thanks
> Guilherme
> 
>> On 18 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com 
>> <mailto:paras.leh...@indiamart.com>> wrote:
>> 
>> Hi Guilherme,
>> 
>> Have you tried reindexing the documents and compare the results? No issues
>> if you cannot do that - let's try something else. I was going through the
>> whole mail and your files. You had said:
>> 
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>>> don't get anything (which make sense).
>> 
>> 
>> Why did you think that not getting anything when you add dbId made sense?
>> Asking because I may be missing something here.
>> 
>> Also, what is the purpose of so many qf's? Going through your documents and
>> config files, I found that your dbId's are string of numbers and I don't
>> think you want to find your query terms in dbId, right?
>> Do you want to boost the score by the values in dbId?
>> 
>> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
>> your terms don't match with the values in dbId for any document, the score
>> produced by this scoring is 0. 100x or 1x of 0 is still 0.
>> I still need to see how this scoring gets added up in edismax parser but do
>> reevaluate the usage of these qfs. Same goes for other qf boosts. :)
>> 
>> 
>> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <gvit...@ebi.ac.uk 
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>> 
>>> Hi Paras
>>> No worries.
>>> No I didn’t find anything. This is annoying now...
>>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
>>> actually my key, if you check again the schema.xml
>>> 
>>> Cheers
>>> Guilherme
>>> 
>>> On 15 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com 
>>> <mailto:paras.leh...@indiamart.com>> wrote:
>>> 
>>> 
>>> Hey Guilherme,
>>> 
>>> I was a bit busy for the past few days and couldn't read your mail. So,
>>> did you find anything? Anyways, as I had expected, the culprit is
>>> definitely among the qfs. Do the documents in concern contain dbId? I
>>> suggest you to cross check the fields in your document with those impacting
>>> the result in qf.
>>> 
>>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <gvit...@ebi.ac.uk 
>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>> 
>>>> What I can't understand is:
>>>> I search for the exact term - "Immunoregulatory interactions between a
>>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>>>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>>>> *non-Lymphoid
>>>> cell" then it works
>>>> 
>>>> On 11 Nov 2019, at 12:24, Guilherme Viteri <gvit...@ebi.ac.uk 
>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> Thanks
>>>> 
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>>> 
>>>> Yes. It always make sense the way we've been using.
>>>> 
>>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>>> filter
>>>> is working as expected. The problem definitely lies in the configuration
>>>> of
>>>> edismax.
>>>> 
>>>> I see.
>>>> 
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>>> 
>>>> Ok, using q now, removed all qf, performed the search and I got 23
>>>> results, and the one I really want, on the top.
>>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>>>> I don't get anything (which make sense). However if I query name_exact, I
>>>> get the 23 results again, and unfortunately if I query stId^1.0
>>>> name_exact^10.0 I still don't get any results.
>>>> 
>>>> In summary
>>>> - without qf - 23 results
>>>> - dbId - 0 results
>>>> - name_exact - 16 results
>>>> - name - 23 results
>>>> - dbId^1.0
>>>> name_exact^10.0 - 0 results
>>>> - 0 results if any other, stId, dbId (key) is added on top of the
>>>> name(name_exact, etc).
>>>> 
>>>> Definitely lost here! :-/
>>>> 
>>>> 
>>>> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com 
>>>> <mailto:paras.leh...@indiamart.com>>
>>>> wrote:
>>>> 
>>>> Hi
>>>> 
>>>> So I don't think removing it completely is the way to go from the scenario
>>>> 
>>>> we have
>>>> 
>>>> 
>>>> 
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>>> 
>>>> 
>>>> Quite a considerable increase
>>>> 
>>>> 
>>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>>> filter
>>>> is working as expected. The problem definitely lies in the configuration
>>>> of
>>>> edismax.
>>>> 
>>>> 
>>>> 
>>>> I am sorry but I didn't understand what do you want me to do exactly with
>>>> the lst (??) and qf and bf.
>>>> 
>>>> 
>>>> 
>>>> What combinations did you try? I was referring to the field-level boosting
>>>> you have applied in edismax config.
>>>> 
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>>> request handler. There are many qf and some bq boosts. I want you to
>>>> remove
>>>> all of these, check response again (with q now) and keep on adding them
>>>> again (one by one) while looking for when the numFound drastically
>>>> changes.
>>>> 
>>>> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com 
>>>> <mailto:hastings.recurs...@gmail.com>
>>>>> 
>>>> wrote:
>>>> 
>>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>>> pretty well for such a solution, but for a full index the size became
>>>> prohibitive
>>>> 
>>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org 
>>>> <mailto:wun...@wunderwood.org>>
>>>> wrote:
>>>> 
>>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>>> 
>>>> is
>>>> 
>>>> a hack that mostly works.
>>>> 
>>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>>> 
>>>> hastings.recurs...@gmail.com> wrote:
>>>> 
>>>> 
>>>> the pf and qf fields are REALLY nice for this
>>>> 
>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>>> 
>>>> wun...@wunderwood.org>
>>>> 
>>>> wrote:
>>>> 
>>>> I always enable phrase searching in edismax for exactly this reason.
>>>> 
>>>> Something like:
>>>> 
>>>>    <str name="qf”>title^8 keywords^4 text</str>
>>>>    <str name="pf”>title^16 keywords^8 text^2</str>
>>>> 
>>>> To deal with concepts in queries, a classifier and/or named entity
>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>>> 
>>>> that
>>>> 
>>>> term can be queried against the field matching that vocabulary.
>>>> 
>>>> This is how LinkedIn separates people, companies, and places, for
>>>> 
>>>> example.
>>>> 
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com
>>>> 
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>>> 
>>>> not
>>>> 
>>>> entirely likely to do what you want either since virtually every doc
>>>> 
>>>> will
>>>> 
>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>> 
>>>> 
>>>> you may also be able to search for things like “Lamin A” _only as a
>>>> 
>>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>>> 
>>>> Some
>>>> 
>>>> people have been able to substitute synonyms and/or shingles to make
>>>> 
>>>> this
>>>> 
>>>> work at the expense of a larger index.
>>>> 
>>>> 
>>>> This is a generic problem with context. “Lamin A” is really a
>>>> 
>>>> “concept”,
>>>> 
>>>> not just two words that happen to be near each other. Searching as a
>>>> 
>>>> phrase
>>>> 
>>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>>> 
>>>> “if
>>>> 
>>>> these two words appear next to each other, they’re more likely to be
>>>> 
>>>> what I
>>>> 
>>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>>> 
>>>> _also_ be
>>>> 
>>>> found for a naive phrase search. (I have no idea whether such a title
>>>> 
>>>> makes
>>>> 
>>>> sense or not, but you figured that out already)...
>>>> 
>>>> 
>>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>> 
>>>> I truly wish we could have the DWIM search algorithm (Do What I
>>>> 
>>>> Mean)….
>>>> 
>>>> 
>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> HI Walter and Paras
>>>> 
>>>> I indexed it removing all the references to StopWordFilter and I
>>>> 
>>>> went
>>>> 
>>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>>> 
>>>> So I
>>>> 
>>>> don't think removing it completely is the way to go from the scenario
>>>> 
>>>> we
>>>> 
>>>> have, but I appreciate the suggestion…
>>>> 
>>>> 
>>>> Yes the response is using fl=*
>>>> I am trying some combinations at the moment, but yet no success.
>>>> 
>>>> defType=edismax
>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>> Number of results=1599
>>>> Quite a considerable increase, even though reasonable meaningful
>>>> 
>>>> results.
>>>> 
>>>> 
>>>> I am sorry but I didn't understand what do you want me to do exactly
>>>> 
>>>> with the lst (??) and qf and bf.
>>>> 
>>>> 
>>>> Thanks everyone with their inputs
>>>> 
>>>> 
>>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Hi Guilherme
>>>> 
>>>> By accident, I ended up querying the using the default handler
>>>> 
>>>> (/select) and it worked.
>>>> 
>>>> 
>>>> You've just found the culprit. Thanks for giving the material I
>>>> 
>>>> requested. Your analysis chain is working as expected. I don't see any
>>>> issue in either StopWordFilter or your boosts. I also use a boost of
>>>> 
>>>> 50
>>>> 
>>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>>> 
>>>> of
>>>> 
>>>> iphone) but I take Walter's suggestion and would try to optimize my
>>>> weights. I agree that this 50 thing was not researched much about by
>>>> 
>>>> us
>>>> 
>>>> as
>>>> 
>>>> well (we never faced performance or relevance issues).
>>>> 
>>>> 
>>>> See the major difference in both the handlers - edismax. I'm pretty
>>>> 
>>>> sure that your problem lies in the parsing of queries (you can confirm
>>>> 
>>>> that
>>>> 
>>>> from parsedquery key in debug of both JSON responses). I hope you have
>>>> provided the response with fl=*. Replace q with q.alt in your /search
>>>> handler query and I think you should start getting responses. That's
>>>> because q.alt uses standard parser. If you want to keep using
>>>> 
>>>> edisMax, I
>>>> 
>>>> suggest you to test the responses removing some combination of lst
>>>> 
>>>> (qf,
>>>> 
>>>> bf)
>>>> 
>>>> and find what's restricting the documents to come up. I'm out of
>>>> 
>>>> office
>>>> 
>>>> today - would have certainly tried analyzing the field values of the
>>>> document in /select request and compare it with qf/bq in
>>>> 
>>>> solrconfig.xml
>>>> 
>>>> /search. Do this for me and you'd certainly find something.
>>>> 
>>>> 
>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>>> 
>>>> wun...@wunderwood.org
>>>> 
>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>> 
>>>> I normally use a weight of 8 for the most important field, like
>>>> 
>>>> title.
>>>> 
>>>> Other fields might get a 4 or 2.
>>>> 
>>>> 
>>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>> 
>>>> have a higher weight.
>>>> 
>>>> 
>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>> 
>>>> early web search engines. With different relevance algorithms and
>>>> 
>>>> totally
>>>> 
>>>> different evaluation and tuning systems, they settled on weights of 8
>>>> 
>>>> and
>>>> 
>>>> 7.5 for HTML titles. With the the two radically different system
>>>> 
>>>> getting
>>>> 
>>>> the same number, I decided that was a property of the documents, not
>>>> 
>>>> of
>>>> 
>>>> the
>>>> 
>>>> search engines.
>>>> 
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>> 
>>>> (my blog)
>>>> 
>>>> 
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>>> 
>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> 
>>>> Hi Wunder,
>>>> 
>>>> My indexer takes quite a few hours to be executed I am shortening
>>>> 
>>>> it
>>>> 
>>>> to run faster, but I also need to make sure it gives what we are
>>>> 
>>>> expecting.
>>>> 
>>>> This implementation's been there for >4y, and massively used.
>>>> 
>>>> 
>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>> 
>>>> extremely
>>>> 
>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>> 
>>>> years
>>>> 
>>>> of configuring Solr.
>>>> 
>>>> I've inherited that implementation and I am really keen to
>>>> 
>>>> adequate
>>>> 
>>>> it, what would you recommend ?
>>>> 
>>>> 
>>>> Cheers
>>>> Guilherme
>>>> 
>>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
>>>> 
>>>> <mailto:wun...@wunderwood.org>> wrote:
>>>> 
>>>> 
>>>> Thanks for posting the files. Looking at schema.xml, I see that
>>>> 
>>>> you
>>>> 
>>>> still are using StopFilterFactory. The first advice we gave you was to
>>>> remove that.
>>>> 
>>>> 
>>>> Remove StopFilterFactory everywhere and reindex.
>>>> 
>>>> You will continue to have problems matching stopwords until you
>>>> 
>>>> do
>>>> 
>>>> that.
>>>> 
>>>> 
>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>> 
>>>> extremely
>>>> 
>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>> 
>>>> years
>>>> 
>>>> of configuring Solr.
>>>> 
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>>> 
>>>> 
>>>> (my blog)
>>>> 
>>>> 
>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>>> 
>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> 
>>>> Hi Paras, everyone
>>>> 
>>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>> 
>>>> you had trouble with the attachments I will host it somewhere and
>>>> 
>>>> share
>>>> 
>>>> the
>>>> 
>>>> links.
>>>> 
>>>> I don't tweak my index, I get the data from the graph database,
>>>> 
>>>> create a document as they are and save to solr.
>>>> 
>>>> 
>>>> So, I am sending the new analysis screen querying the way you
>>>> 
>>>> suggested. Also the results with params and solr query url.
>>>> 
>>>> 
>>>> During the process of querying what you asked I found something
>>>> 
>>>> really weird (at least for me). By accident, I ended up querying the
>>>> 
>>>> using
>>>> 
>>>> the default handler (/select) and it worked. Then If I use the one I
>>>> 
>>>> must
>>>> 
>>>> use, then sadly doesn't work. I am posting both results and I will
>>>> 
>>>> also
>>>> 
>>>> post the handlers as well.
>>>> 
>>>> 
>>>> Here is the link with all the files mentioned before
>>>> 
>>>> 
>>>> 
>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>> 
>>>> 
>>>> <
>>>> 
>>>> 
>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>> 
>>>> <
>>>> 
>>>> 
>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>> 
>>>> 
>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>> 
>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>> 
>>>> 
>>>> Thanks
>>>> 
>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>>> 
>>>> paras.leh...@indiamart.com
>>>> 
>>>> <mailto:paras.leh...@indiamart.com>> wrote:
>>>> 
>>>> 
>>>> Hi Guilherme.
>>>> 
>>>> I am sending they analysis result and the json result as
>>>> 
>>>> requested.
>>>> 
>>>> 
>>>> 
>>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>> 
>>>> quality
>>>> 
>>>> though).
>>>> 
>>>> From the analysis screen, the analysis is working as expected.
>>>> 
>>>> One
>>>> 
>>>> of the
>>>> 
>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>>> 
>>>> matching
>>>> 
>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>> 
>>>> initially
>>>> 
>>>> think of is: the stopword "a" is probably present in
>>>> 
>>>> post-analysis
>>>> 
>>>> either
>>>> 
>>>> of query or index. Did you tweak your index time analysis after
>>>> 
>>>> indexing?
>>>> 
>>>> 
>>>> Do two things:
>>>> 
>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>> "query=*"lymphoid
>>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>>> 
>>>> the
>>>> 
>>>> link
>>>> 
>>>> here.
>>>> 2. Give the same JSON output as you have sent but this time
>>>> 
>>>> with
>>>> 
>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>> 
>>>> 
>>>> 
>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>> 
>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
>>>> 
>>>> 
>>>> I don’t see the attachments, maybe I deleted old e-mails or
>>>> 
>>>> some
>>>> 
>>>> such. The
>>>> 
>>>> Apache server is fairly aggressive about stripping attachments
>>>> 
>>>> though, so
>>>> 
>>>> it’s also possible they didn’t make it through.
>>>> 
>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>>> 
>>>> gvit...@ebi.ac.uk
>>>> 
>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> 
>>>> Thanks Erick.
>>>> 
>>>> First, your index and analysis chains are considerably
>>>> 
>>>> different, this
>>>> 
>>>> can easily be a source of problems. In particular, using two
>>>> 
>>>> different
>>>> 
>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>> 
>>>> this unless
>>>> 
>>>> you’re totally sure you understand the consequences.
>>>> 
>>>> Additionally, your use
>>>> 
>>>> of the length filter is suspicious, especially since your
>>>> 
>>>> problem
>>>> 
>>>> statement
>>>> 
>>>> is about the addition of a single letter term and the min
>>>> 
>>>> length
>>>> 
>>>> allowed on
>>>> 
>>>> that filter is 2. That said, it’s reasonable to suppose that
>>>> 
>>>> the
>>>> 
>>>> ’a’ is
>>>> 
>>>> filtered out in both cases, but maybe you’ve found something
>>>> 
>>>> odd
>>>> 
>>>> about the
>>>> 
>>>> interactions.
>>>> 
>>>> I will investigate the min length and post the results later.
>>>> 
>>>> Second, I have no idea what this will do. Are the equal
>>>> 
>>>> signs
>>>> 
>>>> typos?
>>>> 
>>>> Used by custom code?
>>>> 
>>>> This the url in my application, not solr params. That's the
>>>> 
>>>> query string.
>>>> 
>>>> 
>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>> 
>>>> likely
>>>> 
>>>> that
>>>> 
>>>> all the params with an equal-sign are totally ignored unless
>>>> 
>>>> it’s
>>>> 
>>>> just a
>>>> 
>>>> typo.
>>>> 
>>>> This is part of the application. Species will be used later
>>>> 
>>>> on
>>>> 
>>>> in solr
>>>> 
>>>> to filter out the result. That's not solr. That my app params.
>>>> 
>>>> 
>>>> Third, the easiest way to see what’s happening under the
>>>> 
>>>> covers
>>>> 
>>>> is to
>>>> 
>>>> add “&debug=true” to the query and look at the parsed query.
>>>> 
>>>> Ignore all the
>>>> 
>>>> relevance calculations for the nonce, or specify
>>>> 
>>>> “&debug=query”
>>>> 
>>>> to skip
>>>> 
>>>> that part.
>>>> 
>>>> The two json files i've sent, they are debugQuery=on and the
>>>> 
>>>> explain tag
>>>> 
>>>> is present.
>>>> 
>>>> I will try the searching the way you mentioned.
>>>> 
>>>> Thank for your inputs
>>>> 
>>>> Guilherme
>>>> 
>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>> 
>>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Fwd to another server
>>>> 
>>>> First, your index and analysis chains are considerably
>>>> 
>>>> different, this
>>>> 
>>>> can easily be a source of problems. In particular, using two
>>>> 
>>>> different
>>>> 
>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>> 
>>>> this unless
>>>> 
>>>> you’re totally sure you understand the consequences.
>>>> 
>>>> Additionally, your use
>>>> 
>>>> of the length filter is suspicious, especially since your
>>>> 
>>>> problem
>>>> 
>>>> statement
>>>> 
>>>> is about the addition of a single letter term and the min
>>>> 
>>>> length
>>>> 
>>>> allowed on
>>>> 
>>>> that filter is 2. That said, it’s reasonable to suppose that
>>>> 
>>>> the
>>>> 
>>>> ’a’ is
>>>> 
>>>> filtered out in both cases, but maybe you’ve found something
>>>> 
>>>> odd
>>>> 
>>>> about the
>>>> 
>>>> interactions.
>>>> 
>>>> 
>>>> Second, I have no idea what this will do. Are the equal
>>>> 
>>>> signs
>>>> 
>>>> typos?
>>>> 
>>>> Used by custom code?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> 
>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>> 
>>>> likely
>>>> 
>>>> that
>>>> 
>>>> all the params with an equal-sign are totally ignored unless
>>>> 
>>>> it’s
>>>> 
>>>> just a
>>>> 
>>>> typo.
>>>> 
>>>> 
>>>> Third, the easiest way to see what’s happening under the
>>>> 
>>>> covers
>>>> 
>>>> is to
>>>> 
>>>> add “&debug=true” to the query and look at the parsed query.
>>>> 
>>>> Ignore all the
>>>> 
>>>> relevance calculations for the nonce, or specify
>>>> 
>>>> “&debug=query”
>>>> 
>>>> to skip
>>>> 
>>>> that part.
>>>> 
>>>> 
>>>> 90% + of the time, the question “why didn’t this query do
>>>> 
>>>> what I
>>>> 
>>>> expect” is answered by looking at the “&debug=query” output
>>>> 
>>>> and
>>>> 
>>>> the
>>>> 
>>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>> 
>>>> sure to look
>>>> 
>>>> at _both_ the query and index output. Also, and very important
>>>> 
>>>> about the
>>>> 
>>>> analysis page (and this is confusing) is that this _assumes_
>>>> 
>>>> that
>>>> 
>>>> what you
>>>> 
>>>> put in the text boxes have made it through the query parser
>>>> 
>>>> intact and is
>>>> 
>>>> analyzed by the field selected. Consider the search
>>>> 
>>>> "q=field:word1 word2".
>>>> 
>>>> Now you type “word1 word2” into the analysis text box and it
>>>> 
>>>> looks like
>>>> 
>>>> what you expect. That’s misleading because the query is
>>>> 
>>>> _parsed_
>>>> 
>>>> as
>>>> 
>>>> "field:word1 default_search_field:word2”. This is where
>>>> 
>>>> “&debug=query”
>>>> 
>>>> helps.
>>>> 
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>> 
>>>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Hi Walter,
>>>> 
>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>> 
>>>> Those words
>>>> 
>>>> will
>>>> 
>>>> not be in the index, so they can never match a query.
>>>> 
>>>> 
>>>> 
>>>> I think the OP's concern is different results when adding a
>>>> 
>>>> stopword. I
>>>> 
>>>> think he's using the filter factory correctly - the query
>>>> 
>>>> chain
>>>> 
>>>> includes
>>>> 
>>>> the filter as well so it should remove "a" while querying.
>>>> 
>>>> *@Guilherme*, please post results for both the query, the
>>>> 
>>>> document in
>>>> 
>>>> result you are concerned about and post full result of
>>>> 
>>>> analysis screen
>>>> 
>>>> (for
>>>> 
>>>> both query and index).
>>>> 
>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>> 
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> No.
>>>> 
>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>> 
>>>> Those words
>>>> 
>>>> will not be in the index, so they can never match a query.
>>>> 
>>>> 1. Remove the lines with solr.StopFilter from every
>>>> 
>>>> analysis
>>>> 
>>>> chain in
>>>> 
>>>> schema.xml.
>>>> 2. Reload the collection, restart Solr, or whatever to
>>>> 
>>>> read
>>>> 
>>>> the new
>>>> 
>>>> config.
>>>> 
>>>> 3. Reindex all of the documents.
>>>> 
>>>> When indexed with the new analysis chain, the stopwords
>>>> 
>>>> will
>>>> 
>>>> not be
>>>> 
>>>> removed and they will be searchable.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <
>>>> 
>>>> http://observer.wunderwood.org/>  (my blog)
>>>> 
>>>> 
>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>> 
>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Ok. I am kind a lost now.
>>>> If I open up the console > analysis and perform it,
>>>> 
>>>> that's
>>>> 
>>>> the final
>>>> 
>>>> result.
>>>> 
>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>> 
>>>> Your suggestion is: get rid of the <filter stopword.txt>
>>>> 
>>>> in
>>>> 
>>>> the
>>>> 
>>>> schema.xml and during index phase replaceAll("in
>>>> 
>>>> stopwords.txt"," ")
>>>> 
>>>> then
>>>> 
>>>> add to solr. Is that correct ?
>>>> 
>>>> 
>>>> Thanks David
>>>> 
>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>> 
>>>> hastings.recurs...@gmail.com <mailto:
>>>> 
>>>> hastings.recurs...@gmail.com
>>>> 
>>>> 
>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>>> 
>>>> hastings.recurs...@gmail.com>>> wrote:
>>>> 
>>>> 
>>>> Fwd to another server
>>>> 
>>>> no,
>>>>  <filter class="solr.StopFilterFactory"
>>>> 
>>>> ignoreCase="true"
>>>> 
>>>> words="stopwords.txt"/>
>>>> 
>>>> is still using stopwords and should be removed, in my
>>>> 
>>>> opinion of
>>>> 
>>>> course,
>>>> 
>>>> based on your use case may be different, but i generally
>>>> 
>>>> axe any
>>>> 
>>>> reference
>>>> 
>>>> to them at all
>>>> 
>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>> 
>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>> 
>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Thanks.
>>>> Haven't I done this here ?
>>>> <fieldType name="text_field" class="solr.TextField"
>>>> positionIncrementGap="100" omitNorms="false" >
>>>> <analyzer type="index">
>>>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>  <filter class="solr.ClassicFilterFactory"/>
>>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>> 
>>>> max="20"/>
>>>> 
>>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>>  <filter class="solr.StopFilterFactory"
>>>> 
>>>> ignoreCase="true"
>>>> 
>>>> words="stopwords.txt"/>
>>>> </analyzer>
>>>> 
>>>> 
>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>> 
>>>> hastings.recurs...@gmail.com <mailto:
>>>> 
>>>> hastings.recurs...@gmail.com
>>>> 
>>>> 
>>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>>> 
>>>> hastings.recurs...@gmail.com>>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Fwd to another server
>>>> 
>>>> The first thing you should do is remove any reference
>>>> 
>>>> to
>>>> 
>>>> stop
>>>> 
>>>> words
>>>> 
>>>> and
>>>> 
>>>> never use them, then re-index your data and try it
>>>> 
>>>> again.
>>>> 
>>>> 
>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>> 
>>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>>> 
>>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> I am performing a search to match a name
>>>> 
>>>> (text_field),
>>>> 
>>>> however
>>>> 
>>>> this
>>>> 
>>>> term
>>>> 
>>>> contains 'and' and 'a' and it doesn't return any
>>>> 
>>>> records. If i
>>>> 
>>>> remove
>>>> 
>>>> 'a'
>>>> 
>>>> then it works.
>>>> e.g
>>>> Search Term: lymphoid and a non-lymphoid cell
>>>> doesn't work:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Search term: lymphoid and non-lymphoid cell
>>>> works:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> <
>>>> 
>>>> 
>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>> 
>>>> 
>>>> interested in the first result
>>>> 
>>>> schema.xml
>>>> <field name="name"
>>>> 
>>>> type="text_field"
>>>> 
>>>> indexed="true"  stored="true"   omitNorms="false"
>>>> 
>>>> required="true"
>>>> 
>>>> multiValued="false"/>
>>>> 
>>>> <analyzer type="query">
>>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="^[/._:]+" replacement=""/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="[/._:]+$" replacement=""/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="[_]" replacement=" "/>
>>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>> 
>>>> max="20"/>
>>>> 
>>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>>  <filter class="solr.StopFilterFactory"
>>>> 
>>>> ignoreCase="true"
>>>> 
>>>> words="stopwords.txt"/>
>>>> </analyzer>
>>>> 
>>>> <fieldType name="text_field" class="solr.TextField"
>>>> positionIncrementGap="100" omitNorms="false" >
>>>> <analyzer type="index">
>>>>  <tokenizer
>>>> 
>>>> class="solr.StandardTokenizerFactory"/>
>>>> 
>>>>  <filter class="solr.ClassicFilterFactory"/>
>>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>> 
>>>> max="20"/>
>>>> 
>>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>>  <filter class="solr.StopFilterFactory"
>>>> 
>>>> ignoreCase="true"
>>>> 
>>>> words="stopwords.txt"/>
>>>> </analyzer>
>>>> <analyzer type="query">
>>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="^[/._:]+" replacement=""/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="[/._:]+$" replacement=""/>
>>>>  <filter class="solr.PatternReplaceFilterFactory"
>>>> pattern="[_]" replacement=" "/>
>>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>> 
>>>> max="20"/>
>>>> 
>>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>>  <filter class="solr.StopFilterFactory"
>>>> 
>>>> ignoreCase="true"
>>>> 
>>>> words="stopwords.txt"/>
>>>> </analyzer>
>>>> </fieldType>
>>>> 
>>>> stopwords.txt
>>>> #Standard english stop words taken from Lucene's
>>>> 
>>>> StopAnalyzer
>>>> 
>>>> a
>>>> b
>>>> c
>>>> ....
>>>> an
>>>> and
>>>> are
>>>> 
>>>> Running SolR 6.6.2.
>>>> 
>>>> Is there anything I could do to prevent this ?
>>>> 
>>>> Thanks
>>>> Guilherme
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> --
>>>> Regards,
>>>> 
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>> 
>>>> --
>>>> IMPORTANT:
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> --
>>>> Regards,
>>>> 
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>> 
>>>> --
>>>> IMPORTANT:
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> --
>>>> Regards,
>>>> 
>>>> Paras Lehana [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>> Work: 01203916600 | Extn:  8173
>>>> 
>>>> IMPORTANT:
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> --
>>>> Regards,
>>>> 
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>> 
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>> 
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>> 
>>>> --
>>>> IMPORTANT:
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>>> 
>> 
>> -- 
>> -- 
>> Regards,
>> 
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>> 
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>> 
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>> 
>> -- 
>> IMPORTANT: 
>> NEVER share your IndiaMART OTP/ Password with anyone.
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to