Hi Wunder, My indexer takes quite a few hours to be executed I am shortening it to run faster, but I also need to make sure it gives what we are expecting. This implementation's been there for >4y, and massively used.
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I > don’t think I’ve ever used a weight higher than 16 in a dozen years of > configuring Solr. I've inherited that implementation and I am really keen to adequate it, what would you recommend ? Cheers Guilherme > On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org> wrote: > > Thanks for posting the files. Looking at schema.xml, I see that you still are > using StopFilterFactory. The first advice we gave you was to remove that. > > Remove StopFilterFactory everywhere and reindex. > > You will continue to have problems matching stopwords until you do that. > > In your edismax handlers, weights of 20, 50, and 100 are extremely high. I > don’t think I’ve ever used a weight higher than 16 in a dozen years of > configuring Solr. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: >> >> Hi Paras, everyone >> >> Thank you again for your inputs and suggestions. I sorry to hear you had >> trouble with the attachments I will host it somewhere and share the links. >> I don't tweak my index, I get the data from the graph database, create a >> document as they are and save to solr. >> >> So, I am sending the new analysis screen querying the way you suggested. >> Also the results with params and solr query url. >> >> During the process of querying what you asked I found something really weird >> (at least for me). By accident, I ended up querying the using the default >> handler (/select) and it worked. Then If I use the one I must use, then >> sadly doesn't work. I am posting both results and I will also post the >> handlers as well. >> >> Here is the link with all the files mentioned before >> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 >> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0> >> If the link doesn't work www dot dropbox dot com slash sh slash >> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0 >> >> Thanks >> >>> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com> wrote: >>> >>> Hi Guilherme. >>> >>> I am sending they analysis result and the json result as requested. >>> >>> >>> Thanks for the effort. Luckily, I can see your attachments (low quality >>> though). >>> >>> From the analysis screen, the analysis is working as expected. One of the >>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching >>> document containing "Lymphoid and a non-Lymphoid cell" I can initially >>> think of is: the stopword "a" is probably present in post-analysis either >>> of query or index. Did you tweak your index time analysis after indexing? >>> >>> Do two things: >>> >>> 1. Post the analysis screen for and index=*"Immunoregulatory >>> interactions between a Lymphoid and a non-Lymphoid cell"* and >>> "query=*"lymphoid >>> and a non-lymphoid cell"*. Try hosting the image and providing the link >>> here. >>> 2. Give the same JSON output as you have sent but this time with >>> *"echoParams=all"*. Also, post the exact Solr query url. >>> >>> >>> >>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The >>>> Apache server is fairly aggressive about stripping attachments though, so >>>> it’s also possible they didn’t make it through. >>>> >>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: >>>>> >>>>> Thanks Erick. >>>>> >>>>>> First, your index and analysis chains are considerably different, this >>>> can easily be a source of problems. In particular, using two different >>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless >>>> you’re totally sure you understand the consequences. Additionally, your use >>>> of the length filter is suspicious, especially since your problem statement >>>> is about the addition of a single letter term and the min length allowed on >>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is >>>> filtered out in both cases, but maybe you’ve found something odd about the >>>> interactions. >>>>> I will investigate the min length and post the results later. >>>>> >>>>>> Second, I have no idea what this will do. Are the equal signs typos? >>>> Used by custom code? >>>>> This the url in my application, not solr params. That's the query string. >>>>> >>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that >>>> all the params with an equal-sign are totally ignored unless it’s just a >>>> typo. >>>>> This is part of the application. Species will be used later on in solr >>>> to filter out the result. That's not solr. That my app params. >>>>> >>>>>> Third, the easiest way to see what’s happening under the covers is to >>>> add “&debug=true” to the query and look at the parsed query. Ignore all the >>>> relevance calculations for the nonce, or specify “&debug=query” to skip >>>> that part. >>>>> The two json files i've sent, they are debugQuery=on and the explain tag >>>> is present. >>>>> I will try the searching the way you mentioned. >>>>> >>>>> Thank for your inputs >>>>> >>>>> Guilherme >>>>> >>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> >>>> wrote: >>>>>> >>>>>> Fwd to another server >>>>>> >>>>>> First, your index and analysis chains are considerably different, this >>>> can easily be a source of problems. In particular, using two different >>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless >>>> you’re totally sure you understand the consequences. Additionally, your use >>>> of the length filter is suspicious, especially since your problem statement >>>> is about the addition of a single letter term and the min length allowed on >>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is >>>> filtered out in both cases, but maybe you’ve found something odd about the >>>> interactions. >>>>>> >>>>>> Second, I have no idea what this will do. Are the equal signs typos? >>>> Used by custom code? >>>>>> >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>> >>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that >>>> all the params with an equal-sign are totally ignored unless it’s just a >>>> typo. >>>>>> >>>>>> Third, the easiest way to see what’s happening under the covers is to >>>> add “&debug=true” to the query and look at the parsed query. Ignore all the >>>> relevance calculations for the nonce, or specify “&debug=query” to skip >>>> that part. >>>>>> >>>>>> 90% + of the time, the question “why didn’t this query do what I >>>> expect” is answered by looking at the “&debug=query” output and the >>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look >>>> at _both_ the query and index output. Also, and very important about the >>>> analysis page (and this is confusing) is that this _assumes_ that what you >>>> put in the text boxes have made it through the query parser intact and is >>>> analyzed by the field selected. Consider the search "q=field:word1 word2". >>>> Now you type “word1 word2” into the analysis text box and it looks like >>>> what you expect. That’s misleading because the query is _parsed_ as >>>> "field:word1 default_search_field:word2”. This is where “&debug=query” >>>> helps. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> >>>> wrote: >>>>>>> >>>>>>> Hi Walter, >>>>>>> >>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words >>>> will >>>>>>>> not be in the index, so they can never match a query. >>>>>>> >>>>>>> >>>>>>> I think the OP's concern is different results when adding a stopword. I >>>>>>> think he's using the filter factory correctly - the query chain >>>> includes >>>>>>> the filter as well so it should remove "a" while querying. >>>>>>> >>>>>>> *@Guilherme*, please post results for both the query, the document in >>>>>>> result you are concerned about and post full result of analysis screen >>>> (for >>>>>>> both query and index). >>>>>>> >>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> >>>> wrote: >>>>>>> >>>>>>>> No. >>>>>>>> >>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words >>>>>>>> will not be in the index, so they can never match a query. >>>>>>>> >>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in >>>>>>>> schema.xml. >>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new >>>> config. >>>>>>>> 3. Reindex all of the documents. >>>>>>>> >>>>>>>> When indexed with the new analysis chain, the stopwords will not be >>>>>>>> removed and they will be searchable. >>>>>>>> >>>>>>>> wunder >>>>>>>> Walter Underwood >>>>>>>> wun...@wunderwood.org >>>>>>>> http://observer.wunderwood.org/ (my blog) >>>>>>>> >>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> >>>> wrote: >>>>>>>>> >>>>>>>>> Ok. I am kind a lost now. >>>>>>>>> If I open up the console > analysis and perform it, that's the final >>>>>>>> result. >>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png> >>>>>>>>> >>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the >>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") >>>> then >>>>>>>> add to solr. Is that correct ? >>>>>>>>> >>>>>>>>> Thanks David >>>>>>>>> >>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings < >>>> hastings.recurs...@gmail.com >>>>>>>> <mailto:hastings.recurs...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> Fwd to another server >>>>>>>>>> >>>>>>>>>> no, >>>>>>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>> >>>>>>>>>> is still using stopwords and should be removed, in my opinion of >>>> course, >>>>>>>>>> based on your use case may be different, but i generally axe any >>>>>>>> reference >>>>>>>>>> to them at all >>>>>>>>>> >>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk >>>>>>>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Haven't I done this here ? >>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>> max="20"/> >>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>> </analyzer> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings < >>>> hastings.recurs...@gmail.com >>>>>>>> <mailto:hastings.recurs...@gmail.com>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Fwd to another server >>>>>>>>>>>> >>>>>>>>>>>> The first thing you should do is remove any reference to stop >>>> words >>>>>>>> and >>>>>>>>>>>> never use them, then re-index your data and try it again. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri < >>>> gvit...@ebi.ac.uk >>>>>>>> <mailto:gvit...@ebi.ac.uk>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I am performing a search to match a name (text_field), however >>>> this >>>>>>>> term >>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i >>>> remove >>>>>>>>>>> 'a' >>>>>>>>>>>>> then it works. >>>>>>>>>>>>> e.g >>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell >>>>>>>>>>>>> doesn't work: >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>> < >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>> >>>>>>>>>>>>> < >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell >>>>>>>>>>>>> works: >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>> < >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>>>>>>> >>>>>>>>>>>>> interested in the first result >>>>>>>>>>>>> >>>>>>>>>>>>> schema.xml >>>>>>>>>>>>> <field name="name" type="text_field" >>>>>>>>>>>>> indexed="true" stored="true" omitNorms="false" >>>> required="true" >>>>>>>>>>>>> multiValued="false"/> >>>>>>>>>>>>> >>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>> max="20"/> >>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>> ignoreCase="true" >>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>> >>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>>>>>>> <analyzer type="index"> >>>>>>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>> max="20"/> >>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>> ignoreCase="true" >>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>> <analyzer type="query"> >>>>>>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>>>>>>> max="20"/> >>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>>>>>>> <filter class="solr.StopFilterFactory" >>>>>>>> ignoreCase="true" >>>>>>>>>>>>> words="stopwords.txt"/> >>>>>>>>>>>>> </analyzer> >>>>>>>>>>>>> </fieldType> >>>>>>>>>>>>> >>>>>>>>>>>>> stopwords.txt >>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer >>>>>>>>>>>>> a >>>>>>>>>>>>> b >>>>>>>>>>>>> c >>>>>>>>>>>>> .... >>>>>>>>>>>>> an >>>>>>>>>>>>> and >>>>>>>>>>>>> are >>>>>>>>>>>>> >>>>>>>>>>>>> Running SolR 6.6.2. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there anything I could do to prevent this ? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Guilherme >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -- >>>>>>> Regards, >>>>>>> >>>>>>> *Paras Lehana* [65871] >>>>>>> Development Engineer, Auto-Suggest, >>>>>>> IndiaMART Intermesh Ltd. >>>>>>> >>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>>>>>> Noida, UP, IN - 201303 >>>>>>> >>>>>>> Mob.: +91-9560911996 >>>>>>> Work: 01203916600 | Extn: *8173* >>>>>>> >>>>>>> -- >>>>>>> IMPORTANT: >>>>>>> NEVER share your IndiaMART OTP/ Password with anyone. >>>>>> >>>>> >>>> >>>> >>> >>> -- >>> -- >>> Regards, >>> >>> *Paras Lehana* [65871] >>> Development Engineer, Auto-Suggest, >>> IndiaMART Intermesh Ltd. >>> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >>> Noida, UP, IN - 201303 >>> >>> Mob.: +91-9560911996 >>> Work: 01203916600 | Extn: *8173* >>> >>> -- >>> IMPORTANT: >>> NEVER share your IndiaMART OTP/ Password with anyone. >> >