Hi, Alright, after trying and trying, I have managed to isolate the fields that are causing the search to fail. Now, all the fields are "<fieldType name="id" class="solr.StrField"/>" are breaking up my search.
I changed the id-StrField to <fieldType name="id" class="solr.TextField"> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> And finally now it works, however I am just scared this is not correct or bad practice as I am dealing with IDs and they should be anyhow parsed. What is your opinion ? Thanks Guilherme > On 18 Nov 2019, at 15:42, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: > > Hi, > >> Have you tried reindexing the documents and compare the results? No issues >> if you cannot do that - let's try something else. I was going through the >> whole mail and your files. You had said: > Yes, but since it hasn't worked as suggested, I kept as you suggested. > >> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I >>> don't get anything (which make sense). >> >> Why did you think that not getting anything when you add dbId made sense? >> Asking because I may be missing something here. > I am searching for a text and I was searching on an ID field, which wouldn't > make sense. > (I will come back to this soon.) > > Ok, I've been adding and removing fields in the qf and I could isolate half > of the problem. First, I have one type of field called keyword_field and I > added the StopWords filter for this field and It worked. Second, > when I add the fields that are id (<fieldType name="id" class="solr.StrField" > /> > > Do you think I should also the stopwords filter for the fieldtype id ? > (I tried, and it worked, but I am not sure if this is conceptually correct, > id, should remain intact from my understand) > > Thanks > Guilherme > >> On 18 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com >> <mailto:paras.leh...@indiamart.com>> wrote: >> >> Hi Guilherme, >> >> Have you tried reindexing the documents and compare the results? No issues >> if you cannot do that - let's try something else. I was going through the >> whole mail and your files. You had said: >> >> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I >>> don't get anything (which make sense). >> >> >> Why did you think that not getting anything when you add dbId made sense? >> Asking because I may be missing something here. >> >> Also, what is the purpose of so many qf's? Going through your documents and >> config files, I found that your dbId's are string of numbers and I don't >> think you want to find your query terms in dbId, right? >> Do you want to boost the score by the values in dbId? >> >> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since >> your terms don't match with the values in dbId for any document, the score >> produced by this scoring is 0. 100x or 1x of 0 is still 0. >> I still need to see how this scoring gets added up in edismax parser but do >> reevaluate the usage of these qfs. Same goes for other qf boosts. :) >> >> >> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <gvit...@ebi.ac.uk >> <mailto:gvit...@ebi.ac.uk>> wrote: >> >>> Hi Paras >>> No worries. >>> No I didn’t find anything. This is annoying now... >>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is >>> actually my key, if you check again the schema.xml >>> >>> Cheers >>> Guilherme >>> >>> On 15 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com >>> <mailto:paras.leh...@indiamart.com>> wrote: >>> >>> >>> Hey Guilherme, >>> >>> I was a bit busy for the past few days and couldn't read your mail. So, >>> did you find anything? Anyways, as I had expected, the culprit is >>> definitely among the qfs. Do the documents in concern contain dbId? I >>> suggest you to cross check the fields in your document with those impacting >>> the result in qf. >>> >>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <gvit...@ebi.ac.uk >>> <mailto:gvit...@ebi.ac.uk>> wrote: >>> >>>> What I can't understand is: >>>> I search for the exact term - "Immunoregulatory interactions between a >>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the >>>> exact term - Immunoregulatory interactions between a Lymphoid *and >>>> *non-Lymphoid >>>> cell" then it works >>>> >>>> On 11 Nov 2019, at 12:24, Guilherme Viteri <gvit...@ebi.ac.uk >>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>> >>>> Thanks >>>> >>>> Removing stopwords is another story. I'm curious to find the reason >>>> assuming that you keep on using stopwords. In some cases, stopwords are >>>> really necessary. >>>> >>>> Yes. It always make sense the way we've been using. >>>> >>>> If q.alt is giving you responses, it's confirmed that your stopwords >>>> filter >>>> is working as expected. The problem definitely lies in the configuration >>>> of >>>> edismax. >>>> >>>> I see. >>>> >>>> *Let me explain again:* In your solrconfig.xml, look at your /search >>>> >>>> Ok, using q now, removed all qf, performed the search and I got 23 >>>> results, and the one I really want, on the top. >>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then >>>> I don't get anything (which make sense). However if I query name_exact, I >>>> get the 23 results again, and unfortunately if I query stId^1.0 >>>> name_exact^10.0 I still don't get any results. >>>> >>>> In summary >>>> - without qf - 23 results >>>> - dbId - 0 results >>>> - name_exact - 16 results >>>> - name - 23 results >>>> - dbId^1.0 >>>> name_exact^10.0 - 0 results >>>> - 0 results if any other, stId, dbId (key) is added on top of the >>>> name(name_exact, etc). >>>> >>>> Definitely lost here! :-/ >>>> >>>> >>>> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com >>>> <mailto:paras.leh...@indiamart.com>> >>>> wrote: >>>> >>>> Hi >>>> >>>> So I don't think removing it completely is the way to go from the scenario >>>> >>>> we have >>>> >>>> >>>> >>>> Removing stopwords is another story. I'm curious to find the reason >>>> assuming that you keep on using stopwords. In some cases, stopwords are >>>> really necessary. >>>> >>>> >>>> Quite a considerable increase >>>> >>>> >>>> If q.alt is giving you responses, it's confirmed that your stopwords >>>> filter >>>> is working as expected. The problem definitely lies in the configuration >>>> of >>>> edismax. >>>> >>>> >>>> >>>> I am sorry but I didn't understand what do you want me to do exactly with >>>> the lst (??) and qf and bf. >>>> >>>> >>>> >>>> What combinations did you try? I was referring to the field-level boosting >>>> you have applied in edismax config. >>>> >>>> *Let me explain again:* In your solrconfig.xml, look at your /search >>>> request handler. There are many qf and some bq boosts. I want you to >>>> remove >>>> all of these, check response again (with q now) and keep on adding them >>>> again (one by one) while looking for when the numFound drastically >>>> changes. >>>> >>>> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com >>>> <mailto:hastings.recurs...@gmail.com> >>>>> >>>> wrote: >>>> >>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked >>>> pretty well for such a solution, but for a full index the size became >>>> prohibitive >>>> >>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org >>>> <mailto:wun...@wunderwood.org>> >>>> wrote: >>>> >>>> If we had IDF for phrases, they would be super effective. The 2X weight >>>> >>>> is >>>> >>>> a hack that mostly works. >>>> >>>> Infoseek had phrase IDF and it was a killer algorithm for relevance. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>> On Nov 8, 2019, at 11:08 AM, David Hastings < >>>> >>>> hastings.recurs...@gmail.com> wrote: >>>> >>>> >>>> the pf and qf fields are REALLY nice for this >>>> >>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood < >>>> >>>> wun...@wunderwood.org> >>>> >>>> wrote: >>>> >>>> I always enable phrase searching in edismax for exactly this reason. >>>> >>>> Something like: >>>> >>>> <str name="qf”>title^8 keywords^4 text</str> >>>> <str name="pf”>title^16 keywords^8 text^2</str> >>>> >>>> To deal with concepts in queries, a classifier and/or named entity >>>> extractor can be helpful. If you have a list of concepts (“controlled >>>> vocabulary”) that includes “Lamin A”, and that shows up in a query, >>>> >>>> that >>>> >>>> term can be queried against the field matching that vocabulary. >>>> >>>> This is how LinkedIn separates people, companies, and places, for >>>> >>>> example. >>>> >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>> >>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com >>>> >>>> >>>> wrote: >>>> >>>> >>>> Look at the “mm” parameter, try setting it to 100%. Although that’t >>>> >>>> not >>>> >>>> entirely likely to do what you want either since virtually every doc >>>> >>>> will >>>> >>>> have “a” in it. But at least you’d get docs that have both terms. >>>> >>>> >>>> you may also be able to search for things like “Lamin A” _only as a >>>> >>>> phrase_ and have some luck. But this is a gnarly problem in general. >>>> >>>> Some >>>> >>>> people have been able to substitute synonyms and/or shingles to make >>>> >>>> this >>>> >>>> work at the expense of a larger index. >>>> >>>> >>>> This is a generic problem with context. “Lamin A” is really a >>>> >>>> “concept”, >>>> >>>> not just two words that happen to be near each other. Searching as a >>>> >>>> phrase >>>> >>>> is an OOB-but-naive way to try to make it more likely that the ranked >>>> results refer to the _concept_ of “Lamin A”. The assumption here is >>>> >>>> “if >>>> >>>> these two words appear next to each other, they’re more likely to be >>>> >>>> what I >>>> >>>> want”. I say “naive” because “Lamins: A new approach to...” would >>>> >>>> _also_ be >>>> >>>> found for a naive phrase search. (I have no idea whether such a title >>>> >>>> makes >>>> >>>> sense or not, but you figured that out already)... >>>> >>>> >>>> To do this well you’d have to dive in to NLP/Machine learning. >>>> >>>> I truly wish we could have the DWIM search algorithm (Do What I >>>> >>>> Mean)…. >>>> >>>> >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk> >>>> >>>> wrote: >>>> >>>> >>>> HI Walter and Paras >>>> >>>> I indexed it removing all the references to StopWordFilter and I >>>> >>>> went >>>> >>>> from 121 results to near 20K as the search term q="Lymphoid and a >>>> non-Lymphoid cell" is matching entities such as "IFT A" or "Lamin A". >>>> >>>> So I >>>> >>>> don't think removing it completely is the way to go from the scenario >>>> >>>> we >>>> >>>> have, but I appreciate the suggestion… >>>> >>>> >>>> Yes the response is using fl=* >>>> I am trying some combinations at the moment, but yet no success. >>>> >>>> defType=edismax >>>> q.alt=Lymphoid and a non-Lymphoid cell >>>> Number of results=1599 >>>> Quite a considerable increase, even though reasonable meaningful >>>> >>>> results. >>>> >>>> >>>> I am sorry but I didn't understand what do you want me to do exactly >>>> >>>> with the lst (??) and qf and bf. >>>> >>>> >>>> Thanks everyone with their inputs >>>> >>>> >>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com> >>>> >>>> wrote: >>>> >>>> >>>> Hi Guilherme >>>> >>>> By accident, I ended up querying the using the default handler >>>> >>>> (/select) and it worked. >>>> >>>> >>>> You've just found the culprit. Thanks for giving the material I >>>> >>>> requested. Your analysis chain is working as expected. I don't see any >>>> issue in either StopWordFilter or your boosts. I also use a boost of >>>> >>>> 50 >>>> >>>> when boosting contextual suggestions (boosting "gold iphone" on a page >>>> >>>> of >>>> >>>> iphone) but I take Walter's suggestion and would try to optimize my >>>> weights. I agree that this 50 thing was not researched much about by >>>> >>>> us >>>> >>>> as >>>> >>>> well (we never faced performance or relevance issues). >>>> >>>> >>>> See the major difference in both the handlers - edismax. I'm pretty >>>> >>>> sure that your problem lies in the parsing of queries (you can confirm >>>> >>>> that >>>> >>>> from parsedquery key in debug of both JSON responses). I hope you have >>>> provided the response with fl=*. Replace q with q.alt in your /search >>>> handler query and I think you should start getting responses. That's >>>> because q.alt uses standard parser. If you want to keep using >>>> >>>> edisMax, I >>>> >>>> suggest you to test the responses removing some combination of lst >>>> >>>> (qf, >>>> >>>> bf) >>>> >>>> and find what's restricting the documents to come up. I'm out of >>>> >>>> office >>>> >>>> today - would have certainly tried analyzing the field values of the >>>> document in /select request and compare it with qf/bq in >>>> >>>> solrconfig.xml >>>> >>>> /search. Do this for me and you'd certainly find something. >>>> >>>> >>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood < >>>> >>>> wun...@wunderwood.org >>>> >>>> <mailto:wun...@wunderwood.org>> wrote: >>>> >>>> I normally use a weight of 8 for the most important field, like >>>> >>>> title. >>>> >>>> Other fields might get a 4 or 2. >>>> >>>> >>>> I add a “pf” field with the weights doubled, so that phrase matches >>>> >>>> have a higher weight. >>>> >>>> >>>> The weight of 8 comes from experience at Infoseek and Inktomi, two >>>> >>>> early web search engines. With different relevance algorithms and >>>> >>>> totally >>>> >>>> different evaluation and tuning systems, they settled on weights of 8 >>>> >>>> and >>>> >>>> 7.5 for HTML titles. With the the two radically different system >>>> >>>> getting >>>> >>>> the same number, I decided that was a property of the documents, not >>>> >>>> of >>>> >>>> the >>>> >>>> search engines. >>>> >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> >>>> >>>> (my blog) >>>> >>>> >>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk >>>> >>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>> >>>> >>>> Hi Wunder, >>>> >>>> My indexer takes quite a few hours to be executed I am shortening >>>> >>>> it >>>> >>>> to run faster, but I also need to make sure it gives what we are >>>> >>>> expecting. >>>> >>>> This implementation's been there for >4y, and massively used. >>>> >>>> >>>> In your edismax handlers, weights of 20, 50, and 100 are >>>> >>>> extremely >>>> >>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen >>>> >>>> years >>>> >>>> of configuring Solr. >>>> >>>> I've inherited that implementation and I am really keen to >>>> >>>> adequate >>>> >>>> it, what would you recommend ? >>>> >>>> >>>> Cheers >>>> Guilherme >>>> >>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org >>>> >>>> <mailto:wun...@wunderwood.org>> wrote: >>>> >>>> >>>> Thanks for posting the files. Looking at schema.xml, I see that >>>> >>>> you >>>> >>>> still are using StopFilterFactory. The first advice we gave you was to >>>> remove that. >>>> >>>> >>>> Remove StopFilterFactory everywhere and reindex. >>>> >>>> You will continue to have problems matching stopwords until you >>>> >>>> do >>>> >>>> that. >>>> >>>> >>>> In your edismax handlers, weights of 20, 50, and 100 are >>>> >>>> extremely >>>> >>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen >>>> >>>> years >>>> >>>> of configuring Solr. >>>> >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/ >>>> >>>> >>>> (my blog) >>>> >>>> >>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk >>>> >>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>> >>>> >>>> Hi Paras, everyone >>>> >>>> Thank you again for your inputs and suggestions. I sorry to hear >>>> >>>> you had trouble with the attachments I will host it somewhere and >>>> >>>> share >>>> >>>> the >>>> >>>> links. >>>> >>>> I don't tweak my index, I get the data from the graph database, >>>> >>>> create a document as they are and save to solr. >>>> >>>> >>>> So, I am sending the new analysis screen querying the way you >>>> >>>> suggested. Also the results with params and solr query url. >>>> >>>> >>>> During the process of querying what you asked I found something >>>> >>>> really weird (at least for me). By accident, I ended up querying the >>>> >>>> using >>>> >>>> the default handler (/select) and it worked. Then If I use the one I >>>> >>>> must >>>> >>>> use, then sadly doesn't work. I am posting both results and I will >>>> >>>> also >>>> >>>> post the handlers as well. >>>> >>>> >>>> Here is the link with all the files mentioned before >>>> >>>> >>>> >>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>> < >>>> >>>> >>>> >>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>>> >>>> >>>> < >>>> >>>> >>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>> >>>> < >>>> >>>> >>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>>> >>>> >>>> If the link doesn't work www dot dropbox dot com slash sh slash >>>> >>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0 >>>> >>>> >>>> Thanks >>>> >>>> On 7 Nov 2019, at 05:23, Paras Lehana < >>>> >>>> paras.leh...@indiamart.com >>>> >>>> <mailto:paras.leh...@indiamart.com>> wrote: >>>> >>>> >>>> Hi Guilherme. >>>> >>>> I am sending they analysis result and the json result as >>>> >>>> requested. >>>> >>>> >>>> >>>> Thanks for the effort. Luckily, I can see your attachments (low >>>> >>>> quality >>>> >>>> though). >>>> >>>> From the analysis screen, the analysis is working as expected. >>>> >>>> One >>>> >>>> of the >>>> >>>> reasons for query="lymphoid and *a* non-lymphoid cell" not >>>> >>>> matching >>>> >>>> document containing "Lymphoid and a non-Lymphoid cell" I can >>>> >>>> initially >>>> >>>> think of is: the stopword "a" is probably present in >>>> >>>> post-analysis >>>> >>>> either >>>> >>>> of query or index. Did you tweak your index time analysis after >>>> >>>> indexing? >>>> >>>> >>>> Do two things: >>>> >>>> 1. Post the analysis screen for and index=*"Immunoregulatory >>>> interactions between a Lymphoid and a non-Lymphoid cell"* and >>>> "query=*"lymphoid >>>> and a non-lymphoid cell"*. Try hosting the image and providing >>>> >>>> the >>>> >>>> link >>>> >>>> here. >>>> 2. Give the same JSON output as you have sent but this time >>>> >>>> with >>>> >>>> *"echoParams=all"*. Also, post the exact Solr query url. >>>> >>>> >>>> >>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson < >>>> >>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote: >>>> >>>> >>>> I don’t see the attachments, maybe I deleted old e-mails or >>>> >>>> some >>>> >>>> such. The >>>> >>>> Apache server is fairly aggressive about stripping attachments >>>> >>>> though, so >>>> >>>> it’s also possible they didn’t make it through. >>>> >>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri < >>>> >>>> gvit...@ebi.ac.uk >>>> >>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>> >>>> >>>> Thanks Erick. >>>> >>>> First, your index and analysis chains are considerably >>>> >>>> different, this >>>> >>>> can easily be a source of problems. In particular, using two >>>> >>>> different >>>> >>>> tokenizers is a huge red flag. I _strongly_ recommend against >>>> >>>> this unless >>>> >>>> you’re totally sure you understand the consequences. >>>> >>>> Additionally, your use >>>> >>>> of the length filter is suspicious, especially since your >>>> >>>> problem >>>> >>>> statement >>>> >>>> is about the addition of a single letter term and the min >>>> >>>> length >>>> >>>> allowed on >>>> >>>> that filter is 2. That said, it’s reasonable to suppose that >>>> >>>> the >>>> >>>> ’a’ is >>>> >>>> filtered out in both cases, but maybe you’ve found something >>>> >>>> odd >>>> >>>> about the >>>> >>>> interactions. >>>> >>>> I will investigate the min length and post the results later. >>>> >>>> Second, I have no idea what this will do. Are the equal >>>> >>>> signs >>>> >>>> typos? >>>> >>>> Used by custom code? >>>> >>>> This the url in my application, not solr params. That's the >>>> >>>> query string. >>>> >>>> >>>> What does “species=“ do? That’s not Solr syntax, so it’s >>>> >>>> likely >>>> >>>> that >>>> >>>> all the params with an equal-sign are totally ignored unless >>>> >>>> it’s >>>> >>>> just a >>>> >>>> typo. >>>> >>>> This is part of the application. Species will be used later >>>> >>>> on >>>> >>>> in solr >>>> >>>> to filter out the result. That's not solr. That my app params. >>>> >>>> >>>> Third, the easiest way to see what’s happening under the >>>> >>>> covers >>>> >>>> is to >>>> >>>> add “&debug=true” to the query and look at the parsed query. >>>> >>>> Ignore all the >>>> >>>> relevance calculations for the nonce, or specify >>>> >>>> “&debug=query” >>>> >>>> to skip >>>> >>>> that part. >>>> >>>> The two json files i've sent, they are debugQuery=on and the >>>> >>>> explain tag >>>> >>>> is present. >>>> >>>> I will try the searching the way you mentioned. >>>> >>>> Thank for your inputs >>>> >>>> Guilherme >>>> >>>> On 6 Nov 2019, at 14:14, Erick Erickson < >>>> >>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> >>>> >>>> wrote: >>>> >>>> >>>> Fwd to another server >>>> >>>> First, your index and analysis chains are considerably >>>> >>>> different, this >>>> >>>> can easily be a source of problems. In particular, using two >>>> >>>> different >>>> >>>> tokenizers is a huge red flag. I _strongly_ recommend against >>>> >>>> this unless >>>> >>>> you’re totally sure you understand the consequences. >>>> >>>> Additionally, your use >>>> >>>> of the length filter is suspicious, especially since your >>>> >>>> problem >>>> >>>> statement >>>> >>>> is about the addition of a single letter term and the min >>>> >>>> length >>>> >>>> allowed on >>>> >>>> that filter is 2. That said, it’s reasonable to suppose that >>>> >>>> the >>>> >>>> ’a’ is >>>> >>>> filtered out in both cases, but maybe you’ve found something >>>> >>>> odd >>>> >>>> about the >>>> >>>> interactions. >>>> >>>> >>>> Second, I have no idea what this will do. Are the equal >>>> >>>> signs >>>> >>>> typos? >>>> >>>> Used by custom code? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> >>>> What does “species=“ do? That’s not Solr syntax, so it’s >>>> >>>> likely >>>> >>>> that >>>> >>>> all the params with an equal-sign are totally ignored unless >>>> >>>> it’s >>>> >>>> just a >>>> >>>> typo. >>>> >>>> >>>> Third, the easiest way to see what’s happening under the >>>> >>>> covers >>>> >>>> is to >>>> >>>> add “&debug=true” to the query and look at the parsed query. >>>> >>>> Ignore all the >>>> >>>> relevance calculations for the nonce, or specify >>>> >>>> “&debug=query” >>>> >>>> to skip >>>> >>>> that part. >>>> >>>> >>>> 90% + of the time, the question “why didn’t this query do >>>> >>>> what I >>>> >>>> expect” is answered by looking at the “&debug=query” output >>>> >>>> and >>>> >>>> the >>>> >>>> analysis page in the admin UI. NOTE: for the analysis page be >>>> >>>> sure to look >>>> >>>> at _both_ the query and index output. Also, and very important >>>> >>>> about the >>>> >>>> analysis page (and this is confusing) is that this _assumes_ >>>> >>>> that >>>> >>>> what you >>>> >>>> put in the text boxes have made it through the query parser >>>> >>>> intact and is >>>> >>>> analyzed by the field selected. Consider the search >>>> >>>> "q=field:word1 word2". >>>> >>>> Now you type “word1 word2” into the analysis text box and it >>>> >>>> looks like >>>> >>>> what you expect. That’s misleading because the query is >>>> >>>> _parsed_ >>>> >>>> as >>>> >>>> "field:word1 default_search_field:word2”. This is where >>>> >>>> “&debug=query” >>>> >>>> helps. >>>> >>>> >>>> Best, >>>> Erick >>>> >>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana < >>>> >>>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>> >>>> >>>> wrote: >>>> >>>> >>>> Hi Walter, >>>> >>>> The solr.StopFilter removes all tokens that are stopwords. >>>> >>>> Those words >>>> >>>> will >>>> >>>> not be in the index, so they can never match a query. >>>> >>>> >>>> >>>> I think the OP's concern is different results when adding a >>>> >>>> stopword. I >>>> >>>> think he's using the filter factory correctly - the query >>>> >>>> chain >>>> >>>> includes >>>> >>>> the filter as well so it should remove "a" while querying. >>>> >>>> *@Guilherme*, please post results for both the query, the >>>> >>>> document in >>>> >>>> result you are concerned about and post full result of >>>> >>>> analysis screen >>>> >>>> (for >>>> >>>> both query and index). >>>> >>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood < >>>> >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>> >>>> >>>> wrote: >>>> >>>> >>>> No. >>>> >>>> The solr.StopFilter removes all tokens that are stopwords. >>>> >>>> Those words >>>> >>>> will not be in the index, so they can never match a query. >>>> >>>> 1. Remove the lines with solr.StopFilter from every >>>> >>>> analysis >>>> >>>> chain in >>>> >>>> schema.xml. >>>> 2. Reload the collection, restart Solr, or whatever to >>>> >>>> read >>>> >>>> the new >>>> >>>> config. >>>> >>>> 3. Reindex all of the documents. >>>> >>>> When indexed with the new analysis chain, the stopwords >>>> >>>> will >>>> >>>> not be >>>> >>>> removed and they will be searchable. >>>> >>>> wunder >>>> Walter Underwood >>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org> >>>> http://observer.wunderwood.org/ < >>>> >>>> http://observer.wunderwood.org/> (my blog) >>>> >>>> >>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri < >>>> >>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>> >>>> >>>> wrote: >>>> >>>> >>>> Ok. I am kind a lost now. >>>> If I open up the console > analysis and perform it, >>>> >>>> that's >>>> >>>> the final >>>> >>>> result. >>>> >>>> <Screenshot 2019-11-05 at 14.54.16.png> >>>> >>>> Your suggestion is: get rid of the <filter stopword.txt> >>>> >>>> in >>>> >>>> the >>>> >>>> schema.xml and during index phase replaceAll("in >>>> >>>> stopwords.txt"," ") >>>> >>>> then >>>> >>>> add to solr. Is that correct ? >>>> >>>> >>>> Thanks David >>>> >>>> On 5 Nov 2019, at 14:48, David Hastings < >>>> >>>> hastings.recurs...@gmail.com <mailto: >>>> >>>> hastings.recurs...@gmail.com >>>> >>>> >>>> <mailto:hastings.recurs...@gmail.com <mailto: >>>> >>>> hastings.recurs...@gmail.com>>> wrote: >>>> >>>> >>>> Fwd to another server >>>> >>>> no, >>>> <filter class="solr.StopFilterFactory" >>>> >>>> ignoreCase="true" >>>> >>>> words="stopwords.txt"/> >>>> >>>> is still using stopwords and should be removed, in my >>>> >>>> opinion of >>>> >>>> course, >>>> >>>> based on your use case may be different, but i generally >>>> >>>> axe any >>>> >>>> reference >>>> >>>> to them at all >>>> >>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri < >>>> >>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk> >>>> >>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>> >>>> >>>> wrote: >>>> >>>> >>>> Thanks. >>>> Haven't I done this here ? >>>> <fieldType name="text_field" class="solr.TextField" >>>> positionIncrementGap="100" omitNorms="false" > >>>> <analyzer type="index"> >>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>> <filter class="solr.ClassicFilterFactory"/> >>>> <filter class="solr.LengthFilterFactory" min="2" >>>> >>>> max="20"/> >>>> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.StopFilterFactory" >>>> >>>> ignoreCase="true" >>>> >>>> words="stopwords.txt"/> >>>> </analyzer> >>>> >>>> >>>> On 5 Nov 2019, at 14:15, David Hastings < >>>> >>>> hastings.recurs...@gmail.com <mailto: >>>> >>>> hastings.recurs...@gmail.com >>>> >>>> >>>> <mailto:hastings.recurs...@gmail.com <mailto: >>>> >>>> hastings.recurs...@gmail.com>>> >>>> >>>> wrote: >>>> >>>> >>>> Fwd to another server >>>> >>>> The first thing you should do is remove any reference >>>> >>>> to >>>> >>>> stop >>>> >>>> words >>>> >>>> and >>>> >>>> never use them, then re-index your data and try it >>>> >>>> again. >>>> >>>> >>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri < >>>> >>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk> >>>> >>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>> >>>> >>>> wrote: >>>> >>>> >>>> Hi, >>>> >>>> I am performing a search to match a name >>>> >>>> (text_field), >>>> >>>> however >>>> >>>> this >>>> >>>> term >>>> >>>> contains 'and' and 'a' and it doesn't return any >>>> >>>> records. If i >>>> >>>> remove >>>> >>>> 'a' >>>> >>>> then it works. >>>> e.g >>>> Search Term: lymphoid and a non-lymphoid cell >>>> doesn't work: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> < >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> >>>> < >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> >>>> >>>> Search term: lymphoid and non-lymphoid cell >>>> works: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> < >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> < >>>> >>>> >>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>> >>>> >>>> interested in the first result >>>> >>>> schema.xml >>>> <field name="name" >>>> >>>> type="text_field" >>>> >>>> indexed="true" stored="true" omitNorms="false" >>>> >>>> required="true" >>>> >>>> multiValued="false"/> >>>> >>>> <analyzer type="query"> >>>> <tokenizer class="solr.PatternTokenizerFactory" >>>> pattern="[^a-zA-Z0-9/._:]"/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="^[/._:]+" replacement=""/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="[/._:]+$" replacement=""/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="[_]" replacement=" "/> >>>> <filter class="solr.LengthFilterFactory" min="2" >>>> >>>> max="20"/> >>>> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.StopFilterFactory" >>>> >>>> ignoreCase="true" >>>> >>>> words="stopwords.txt"/> >>>> </analyzer> >>>> >>>> <fieldType name="text_field" class="solr.TextField" >>>> positionIncrementGap="100" omitNorms="false" > >>>> <analyzer type="index"> >>>> <tokenizer >>>> >>>> class="solr.StandardTokenizerFactory"/> >>>> >>>> <filter class="solr.ClassicFilterFactory"/> >>>> <filter class="solr.LengthFilterFactory" min="2" >>>> >>>> max="20"/> >>>> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.StopFilterFactory" >>>> >>>> ignoreCase="true" >>>> >>>> words="stopwords.txt"/> >>>> </analyzer> >>>> <analyzer type="query"> >>>> <tokenizer class="solr.PatternTokenizerFactory" >>>> pattern="[^a-zA-Z0-9/._:]"/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="^[/._:]+" replacement=""/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="[/._:]+$" replacement=""/> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="[_]" replacement=" "/> >>>> <filter class="solr.LengthFilterFactory" min="2" >>>> >>>> max="20"/> >>>> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.StopFilterFactory" >>>> >>>> ignoreCase="true" >>>> >>>> words="stopwords.txt"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> stopwords.txt >>>> #Standard english stop words taken from Lucene's >>>> >>>> StopAnalyzer >>>> >>>> a >>>> b >>>> c >>>> .... >>>> an >>>> and >>>> are >>>> >>>> Running SolR 6.6.2. >>>> >>>> Is there anything I could do to prevent this ? >>>> >>>> Thanks >>>> Guilherme >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Regards, >>>> >>>> *Paras Lehana* [65871] >>>> Development Engineer, Auto-Suggest, >>>> IndiaMART Intermesh Ltd. >>>> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>> Noida, UP, IN - 201303 >>>> >>>> Mob.: +91-9560911996 >>>> Work: 01203916600 | Extn: *8173* >>>> >>>> -- >>>> IMPORTANT: >>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Regards, >>>> >>>> *Paras Lehana* [65871] >>>> Development Engineer, Auto-Suggest, >>>> IndiaMART Intermesh Ltd. >>>> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>> Noida, UP, IN - 201303 >>>> >>>> Mob.: +91-9560911996 >>>> Work: 01203916600 | Extn: *8173* >>>> >>>> -- >>>> IMPORTANT: >>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Regards, >>>> >>>> Paras Lehana [65871] >>>> Development Engineer, Auto-Suggest, >>>> IndiaMART Intermesh Ltd. >>>> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>> Noida, UP, IN - 201303 >>>> >>>> Mob.: +91-9560911996 <tel:+91-9560911996> >>>> Work: 01203916600 | Extn: 8173 >>>> >>>> IMPORTANT: >>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> -- >>>> Regards, >>>> >>>> *Paras Lehana* [65871] >>>> Development Engineer, Auto-Suggest, >>>> IndiaMART Intermesh Ltd. >>>> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>> Noida, UP, IN - 201303 >>>> >>>> Mob.: +91-9560911996 >>>> Work: 01203916600 | Extn: *8173* >>>> >>>> -- >>>> IMPORTANT: >>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> -- >>> Regards, >>> >>> *Paras Lehana* [65871] >>> Development Engineer, Auto-Suggest, >>> IndiaMART Intermesh Ltd. >>> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>> Noida, UP, IN - 201303 >>> >>> Mob.: +91-9560911996 >>> Work: 01203916600 | Extn: *8173* >>> >>> IMPORTANT: >>> NEVER share your IndiaMART OTP/ Password with anyone. >>> >>> >> >> -- >> -- >> Regards, >> >> *Paras Lehana* [65871] >> Development Engineer, Auto-Suggest, >> IndiaMART Intermesh Ltd. >> >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >> Noida, UP, IN - 201303 >> >> Mob.: +91-9560911996 >> Work: 01203916600 | Extn: *8173* >> >> -- >> IMPORTANT: >> NEVER share your IndiaMART OTP/ Password with anyone. >