I normally use a weight of 8 for the most important field, like title. Other fields might get a 4 or 2.
I add a “pf” field with the weights doubled, so that phrase matches have a higher weight. The weight of 8 comes from experience at Infoseek and Inktomi, two early web search engines. With different relevance algorithms and totally different evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML titles. With the the two radically different system getting the same number, I decided that was a property of the documents, not of the search engines. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: > > Hi Wunder, > > My indexer takes quite a few hours to be executed I am shortening it to run > faster, but I also need to make sure it gives what we are expecting. This > implementation's been there for >4y, and massively used. > >> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I >> don’t think I’ve ever used a weight higher than 16 in a dozen years of >> configuring Solr. > I've inherited that implementation and I am really keen to adequate it, what > would you recommend ? > > Cheers > Guilherme > >> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org> wrote: >> >> Thanks for posting the files. Looking at schema.xml, I see that you still >> are using StopFilterFactory. The first advice we gave you was to remove that. >> >> Remove StopFilterFactory everywhere and reindex. >> >> You will continue to have problems matching stopwords until you do that. >> >> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I >> don’t think I’ve ever used a weight higher than 16 in a dozen years of >> configuring Solr. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: >>> >>> Hi Paras, everyone >>> >>> Thank you again for your inputs and suggestions. I sorry to hear you had >>> trouble with the attachments I will host it somewhere and share the links. >>> I don't tweak my index, I get the data from the graph database, create a >>> document as they are and save to solr. >>> >>> So, I am sending the new analysis screen querying the way you suggested. >>> Also the results with params and solr query url. >>> >>> During the process of querying what you asked I found something really >>> weird (at least for me). By accident, I ended up querying the using the >>> default handler (/select) and it worked. Then If I use the one I must use, >>> then sadly doesn't work. I am posting both results and I will also post the >>> handlers as well. >>> >>> Here is the link with all the files mentioned before >>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0> >>> If the link doesn't work www dot dropbox dot com slash sh slash >>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0 >>> >>> Thanks >>> >>>> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com> wrote: >>>> >>>> Hi Guilherme. >>>> >>>> I am sending they analysis result and the json result as requested. >>>> >>>> >>>> Thanks for the effort. Luckily, I can see your attachments (low quality >>>> though). >>>> >>>> From the analysis screen, the analysis is working as expected. One of the >>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching >>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially >>>> think of is: the stopword "a" is probably present in post-analysis either >>>> of query or index. Did you tweak your index time analysis after indexing? >>>> >>>> Do two things: >>>> >>>> 1. Post the analysis screen for and index=*"Immunoregulatory >>>> interactions between a Lymphoid and a non-Lymphoid cell"* and >>>> "query=*"lymphoid >>>> and a non-lymphoid cell"*. Try hosting the image and providing the link >>>> here. >>>> 2. Give the same JSON output as you have sent but this time with >>>> *"echoParams=all"*. Also, post the exact Solr query url. >>>> >>>> >>>> >>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> >>>> wrote: >>>> >>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The >>>>> Apache server is fairly aggressive about stripping attachments though, so >>>>> it’s also possible they didn’t make it through. >>>>> >>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: >>>>>> >>>>>> Thanks Erick. >>>>>> >>>>>>> First, your index and analysis chains are considerably different, this >>>>> can easily be a source of problems. In particular, using two different >>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless >>>>> you’re totally sure you understand the consequences. Additionally, your >>>>> use >>>>> of the length filter is suspicious, especially since your problem >>>>> statement >>>>> is about the addition of a single letter term and the min length allowed >>>>> on >>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is >>>>> filtered out in both cases, but maybe you’ve found something odd about the >>>>> interactions. >>>>>> I will investigate the min length and post the results later. >>>>>> >>>>>>> Second, I have no idea what this will do. Are the equal signs typos? >>>>> Used by custom code? >>>>>> This the url in my application, not solr params. That's the query string. >>>>>> >>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that >>>>> all the params with an equal-sign are totally ignored unless it’s just a >>>>> typo. >>>>>> This is part of the application. Species will be used later on in solr >>>>> to filter out the result. That's not solr. That my app params. >>>>>> >>>>>>> Third, the easiest way to see what’s happening under the covers is to >>>>> add “&debug=true” to the query and look at the parsed query. Ignore all >>>>> the >>>>> relevance calculations for the nonce, or specify “&debug=query” to skip >>>>> that part. >>>>>> The two json files i've sent, they are debugQuery=on and the explain tag >>>>> is present. >>>>>> I will try the searching the way you mentioned. >>>>>> >>>>>> Thank for your inputs >>>>>> >>>>>> Guilherme >>>>>> >>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> >>>>> wrote: >>>>>>> >>>>>>> Fwd to another server >>>>>>> >>>>>>> First, your index and analysis chains are considerably different, this >>>>> can easily be a source of problems. In particular, using two different >>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless >>>>> you’re totally sure you understand the consequences. Additionally, your >>>>> use >>>>> of the length filter is suspicious, especially since your problem >>>>> statement >>>>> is about the addition of a single letter term and the min length allowed >>>>> on >>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is >>>>> filtered out in both cases, but maybe you’ve found something odd about the >>>>> interactions. >>>>>>> >>>>>>> Second, I have no idea what this will do. Are the equal signs typos? >>>>> Used by custom code? >>>>>>> >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>> >>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that >>>>> all the params with an equal-sign are totally ignored unless it’s just a >>>>> typo. >>>>>>> >>>>>>> Third, the easiest way to see what’s happening under the covers is to >>>>> add “&debug=true” to the query and look at the parsed query. Ignore all >>>>> the >>>>> relevance calculations for the nonce, or specify “&debug=query” to skip >>>>> that part. >>>>>>> >>>>>>> 90% + of the time, the question “why didn’t this query do what I >>>>> expect” is answered by looking at the “&debug=query” output and the >>>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look >>>>> at _both_ the query and index output. Also, and very important about the >>>>> analysis page (and this is confusing) is that this _assumes_ that what you >>>>> put in the text boxes have made it through the query parser intact and is >>>>> analyzed by the field selected. Consider the search "q=field:word1 word2". >>>>> Now you type “word1 word2” into the analysis text box and it looks like >>>>> what you expect. That’s misleading because the query is _parsed_ as >>>>> "field:word1 default_search_field:word2”. This is where “&debug=query” >>>>> helps. >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> >>>>> wrote: >>>>>>>> >>>>>>>> Hi Walter, >>>>>>>> >>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words >>>>> will >>>>>>>>> not be in the index, so they can never match a query. >>>>>>>> >>>>>>>> >>>>>>>> I think the OP's concern is different results when adding a stopword. I >>>>>>>> think he's using the filter factory correctly - the query chain >>>>> includes >>>>>>>> the filter as well so it should remove "a" while querying. >>>>>>>> >>>>>>>> *@Guilherme*, please post results for both the query, the document in >>>>>>>> result you are concerned about and post full result of analysis screen >>>>> (for >>>>>>>> both query and index). >>>>>>>> >>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> >>>>> wrote: >>>>>>>> >>>>>>>>> No. >>>>>>>>> >>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words >>>>>>>>> will not be in the index, so they can never match a query. >>>>>>>>> >>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in >>>>>>>>> schema.xml. >>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new >>>>> config. >>>>>>>>> 3. Reindex all of the documents. >>>>>>>>> >>>>>>>>> When indexed with the new analysis chain, the stopwords will not be >>>>>>>>> removed and they will be searchable. >>>>>>>>> >>>>>>>>> wunder >>>>>>>>> Walter Underwood >>>>>>>>> wun...@wunderwood.org >>>>>>>>> http://observer.wunderwood.org/ (my blog) >>>>>>>>> >>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> >>>>> wrote: >>>>>>>>>> >>>>>>>>>> Ok. I am kind a lost now. >>>>>>>>>> If I open up the console > analysis and perform it, that's the final >>>>>>>>> result. >>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png> >>>>>>>>>> >>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the >>>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") >>>>> then >>>>>>>>> add to solr. Is that correct ? >>>>>>>>>> >>>>>>>>>> Thanks David >>>>>>>>>> >>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings < >>>>> hastings.recurs...@gmail.com >>>>>>>>> <mailto:hastings.recurs...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> Fwd to another server >>>>>>>>>>> >>>>>>>>>>> no, >>>>>>>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>> >>>>>>>>>>> is still using stopwords and should be removed, in my opinion of >>>>> course, >>>>>>>>>>> based on your use case may be different, but i generally axe any >>>>>>>>> reference >>>>>>>>>>> to them at all >>>>>>>>>>> >>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk >>>>>>>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> Haven't I done this here ? >>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>> max="20"/> >>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>> </analyzer> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings < >>>>> hastings.recurs...@gmail.com >>>>>>>>> <mailto:hastings.recurs...@gmail.com>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Fwd to another server >>>>>>>>>>>>> >>>>>>>>>>>>> The first thing you should do is remove any reference to stop >>>>> words >>>>>>>>> and >>>>>>>>>>>>> never use them, then re-index your data and try it again. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri < >>>>> gvit...@ebi.ac.uk >>>>>>>>> <mailto:gvit...@ebi.ac.uk>> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am performing a search to match a name (text_field), however >>>>> this >>>>>>>>> term >>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i >>>>> remove >>>>>>>>>>>> 'a' >>>>>>>>>>>>>> then it works. >>>>>>>>>>>>>> e.g >>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell >>>>>>>>>>>>>> doesn't work: >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>> < >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>> >>>>>>>>>>>>>> < >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell >>>>>>>>>>>>>> works: >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>>> < >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>>>> >>>>>>>>>>>>>> interested in the first result >>>>>>>>>>>>>> >>>>>>>>>>>>>> schema.xml >>>>>>>>>>>>>> <field name="name" type="text_field" >>>>>>>>>>>>>> indexed="true" stored="true" omitNorms="false" >>>>> required="true" >>>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>>> >>>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>> >>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>>> max="20"/> >>>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>>> ignoreCase="true" >>>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>>> </fieldType> >>>>>>>>>>>>>> >>>>>>>>>>>>>> stopwords.txt >>>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer >>>>>>>>>>>>>> a >>>>>>>>>>>>>> b >>>>>>>>>>>>>> c >>>>>>>>>>>>>> .... >>>>>>>>>>>>>> an >>>>>>>>>>>>>> and >>>>>>>>>>>>>> are >>>>>>>>>>>>>> >>>>>>>>>>>>>> Running SolR 6.6.2. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there anything I could do to prevent this ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Guilherme >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> *Paras Lehana* [65871] >>>>>>>> Development Engineer, Auto-Suggest, >>>>>>>> IndiaMART Intermesh Ltd. >>>>>>>> >>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>>>>>> Noida, UP, IN - 201303 >>>>>>>> >>>>>>>> Mob.: +91-9560911996 >>>>>>>> Work: 01203916600 | Extn: *8173* >>>>>>>> >>>>>>>> -- >>>>>>>> IMPORTANT: >>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> -- >>>> Regards, >>>> >>>> *Paras Lehana* [65871] >>>> Development Engineer, Auto-Suggest, >>>> IndiaMART Intermesh Ltd. >>>> >>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>> Noida, UP, IN - 201303 >>>> >>>> Mob.: +91-9560911996 >>>> Work: 01203916600 | Extn: *8173* >>>> >>>> -- >>>> IMPORTANT: >>>> NEVER share your IndiaMART OTP/ Password with anyone. >>> >> >