Re: Difference in queryString and Parsed query
Thank you Walter Under Underwood for a complete honest review. I will start simple by using the sample. Regards,LavanyaOn Tuesday, 22 January 2019, 12:31:55 pm AEDT, Walter Underwood wrote: There are many, many problems with this analyzer chain definition. This is a summary of the indexing chain: * WhitespaceTokenizerFilter * LowerCaseFilter * SynonymFilter (with ignoreCase=true after lower-casing everything) * StopFilter (we should have stopped using stopwords 20 years ago) * WordDelimiterFilter (with all the transformation options set to 0, does nothing) * RemoveDuplicates (this must always be last) * KStemFilter (good choice) * EdgeNGramFilter (!!! are you doing prefix matching? doing that with stemming makes bizarre matches) * ReverseStringFilter (Yowza! Only do this on unmodified tokens, what does this mean on word stems? Even more bizarre) Reversed stemmed edge ngrams should cause some really exciting matches. Summary of the query chain: * WhitespaceTokenizerFilter * LowerCaseFilter * PorterStemFilter (different stemmer from indexing, guarantees missed matches) * SynonymFilter (after stemmer? never do this, all tokens need stemmed) * StopFilter (bad, but extra bad after a Porter stemmer that doesn’t generate dictionary words) * WordDelimiterFilter (again, doing nothing, also the results should have been stemmed) * KStemFilter (two stemmers in a chain! never do that! plus the Porter stemmer doesn’t produce dictionary words, so KStem won’t do much) Short version, I’m astonished that this configuration works at all. Delete the whole thing, use one from the sample file (without stop words), and reindex. There is no way to fix this. Not to be mean, but this is the worst field type definition I have ever seen. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jan 21, 2019, at 4:24 AM, Lavanya Thirumalaisami > wrote: > > > Thank you Aman Deep > I tried removing the kstem filter factory and still get the same issue, but > when i comment the Porterstemfilterfactory the character y does not get > replaced. > > On Monday, 21 January 2019, 11:16:23 pm AEDT, Aman deep singh > wrote: > > Hi Lavanya, > This is probably due to the kstem Filter factory it is removing the y > charactor ,since the stemmer has rule of words ending with y . > > > Regards, > Aman Deep Singh > >> On 21-Jan-2019, at 5:43 PM, Mikhail Khludnev wrote: >> >> querystring is what goes into QPaser, parsedquery is >> LuceneQuery.toString() >> >> On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami >> wrote: >> >>> Hi, >>> Our solr search is not returning expected results for keywords ending with >>> the character 'y'. >>> For example keywords like battery, way, accessory etc. >>> I tried debugging the solr query in solr admin console and i find there is >>> a difference between query string and parsed query. >>> "querystring":"battery","parsedquery":"batteri", >>> Also I find that if i search omitting the character y i am getting all the >>> results. >>> This happens only for keywords ending with Y and most others we donot have >>> this issue. >>> Could any one please help me understand why is the keywords gets changed, >>> specially the last character. Is there any issues in my field type >>> definition. >>> While indexing the data we use the text data type and we have defined as >>> follows >>> >> positionIncrementGap="100"> >> class="solr.WhitespaceTokenizerFactory" /> >> class="solr.LowerCaseFilterFactory" /> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" >>> expand="true"/> >> words="stopwords.txt" /> >> catenateWords="1" class="solr.WordDelimiterFilterFactory" >>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1" >>> splitOnCaseChange="0" splitOnNumerics="0" /> >> class="solr.RemoveDuplicatesTokenFilterFactory" /> >> class="solr.KStemFilterFactory" /> >> class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" /> >>> >> type="query"> >> class="solr.LowerCaseFilterFactory" /> >> class="solr.PorterStemFilterFactory" /> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" >>> expand="true"/> >> words="stopwords.txt" /> >> catenateWords="0" class="solr.WordDelimiterFilterFactory" >>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1" >>> splitOnCaseChange="0" splitOnNumerics="0" /> >> class="solr.KStemFilterFactory" /> >>> >>> Regards,Lavanya >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev
Re: Difference in queryString and Parsed query
Thank you Aman Deep I tried removing the kstem filter factory and still get the same issue, but when i comment the Porterstemfilterfactory the character y does not get replaced. On Monday, 21 January 2019, 11:16:23 pm AEDT, Aman deep singh wrote: Hi Lavanya, This is probably due to the kstem Filter factory it is removing the y charactor ,since the stemmer has rule of words ending with y . Regards, Aman Deep Singh > On 21-Jan-2019, at 5:43 PM, Mikhail Khludnev wrote: > > querystring is what goes into QPaser, parsedquery is > LuceneQuery.toString() > > On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami > wrote: > >> Hi, >> Our solr search is not returning expected results for keywords ending with >> the character 'y'. >> For example keywords like battery, way, accessory etc. >> I tried debugging the solr query in solr admin console and i find there is >> a difference between query string and parsed query. >> "querystring":"battery","parsedquery":"batteri", >> Also I find that if i search omitting the character y i am getting all the >> results. >> This happens only for keywords ending with Y and most others we donot have >> this issue. >> Could any one please help me understand why is the keywords gets changed, >> specially the last character. Is there any issues in my field type >> definition. >> While indexing the data we use the text data type and we have defined as >> follows >> > positionIncrementGap="100"> > class="solr.WhitespaceTokenizerFactory" /> > class="solr.LowerCaseFilterFactory" /> > class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" >> expand="true"/> > words="stopwords.txt" /> > catenateWords="1" class="solr.WordDelimiterFilterFactory" >> generateNumberParts="0" generateWordParts="0" preserveOriginal="1" >> splitOnCaseChange="0" splitOnNumerics="0" /> > class="solr.RemoveDuplicatesTokenFilterFactory" /> > class="solr.KStemFilterFactory" /> > class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" /> >> > type="query"> > class="solr.LowerCaseFilterFactory" /> > class="solr.PorterStemFilterFactory" /> > class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" >> expand="true"/> > words="stopwords.txt" /> > catenateWords="0" class="solr.WordDelimiterFilterFactory" >> generateNumberParts="0" generateWordParts="0" preserveOriginal="1" >> splitOnCaseChange="0" splitOnNumerics="0" /> > class="solr.KStemFilterFactory" /> >> >> Regards,Lavanya > > > > -- > Sincerely yours > Mikhail Khludnev
Re: Difference in queryString and Parsed query
thank you Mikhail. On Monday, 21 January 2019, 11:13:51 pm AEDT, Mikhail Khludnev wrote: querystring is what goes into QPaser, parsedquery is LuceneQuery.toString() On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami wrote: > Hi, > Our solr search is not returning expected results for keywords ending with > the character 'y'. > For example keywords like battery, way, accessory etc. > I tried debugging the solr query in solr admin console and i find there is > a difference between query string and parsed query. > "querystring":"battery","parsedquery":"batteri", > Also I find that if i search omitting the character y i am getting all the > results. > This happens only for keywords ending with Y and most others we donot have > this issue. > Could any one please help me understand why is the keywords gets changed, > specially the last character. Is there any issues in my field type > definition. > While indexing the data we use the text data type and we have defined as > follows > positionIncrementGap="100"> class="solr.WhitespaceTokenizerFactory" /> class="solr.LowerCaseFilterFactory" /> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" > expand="true"/> words="stopwords.txt" /> catenateWords="1" class="solr.WordDelimiterFilterFactory" > generateNumberParts="0" generateWordParts="0" preserveOriginal="1" > splitOnCaseChange="0" splitOnNumerics="0" /> class="solr.RemoveDuplicatesTokenFilterFactory" /> class="solr.KStemFilterFactory" /> class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" /> > type="query"> class="solr.LowerCaseFilterFactory" /> class="solr.PorterStemFilterFactory" /> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" > expand="true"/> words="stopwords.txt" /> catenateWords="0" class="solr.WordDelimiterFilterFactory" > generateNumberParts="0" generateWordParts="0" preserveOriginal="1" > splitOnCaseChange="0" splitOnNumerics="0" /> class="solr.KStemFilterFactory" /> > > Regards,Lavanya -- Sincerely yours Mikhail Khludnev
Difference in queryString and Parsed query
Hi, Our solr search is not returning expected results for keywords ending with the character 'y'. For example keywords like battery, way, accessory etc. I tried debugging the solr query in solr admin console and i find there is a difference between query string and parsed query. "querystring":"battery","parsedquery":"batteri", Also I find that if i search omitting the character y i am getting all the results. This happens only for keywords ending with Y and most others we donot have this issue. Could any one please help me understand why is the keywords gets changed, specially the last character. Is there any issues in my field type definition. While indexing the data we use the text data type and we have defined as follows Regards,Lavanya
Re: Debugging Solr Search results & Issues with Distributed IDF
Thank you for the inputs Doug and Charlie. On Wednesday, 2 January 2019, 11:39:13 pm AEDT, Doug Turnbull wrote: On (2) these are BM25 parameters. There are several articles that discuss BM25 in depth https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables On Tue, Jan 1, 2019 at 6:04 PM Lavanya Thirumalaisami wrote: > > Hi, > > I am trying to debug a query to find out why one documentgets more score > than the other. The below are two similar products. > > Below is the debug results I get from Solr admin console. > > "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n 4.7573533= > weight(All:2x in 962) [], result of:\n 4.7573533 = > score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n 3.4598935 = > idf(docFreq=1346, docCount=42836)\n 1.375 = tfNorm, computed > from:\n 2.0 = termFreq=2.0\n 1.2 = parameter > k1\n 0.0 = parameter b (norms omitted forfield)\n 10.452296 = max > of:\n 5.9166136 = weight(All:powerpoint in 962)[], result of:\n > 5.9166136 =score(doc=962,freq=2.0 = termFreq=2.0\n), product of:\n > 4.302992 = idf(docFreq=579,docCount=42836)\n 1.375 = tfNorm,computed > from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n > 0.0 = parameter b (normsomitted for field)\n 10.452296 > =weight(All:\"socket outlet\" in 962) [], result of:\n 10.452296 = > score(doc=962,freq=2.0 =phraseFreq=2.0\n), product of:\n 7.60167 = > idf(), sum of:\n 3.5370626 = idf(docFreq=1246, > docCount=42836)\n 4.064607 = > idf(docFreq=735,docCount=42836)\n 1.375 = tfNorm,computed > from:\n 2.0 =phraseFreq=2.0\n 1.2 = > parameterk1\n 0.0 = parameter b (normsomitted for field)\n", > > "Doc15":"\n13.258003 = sum of:\n 5.7317085 = max of:\n 5.7317085 = > weight(All:doubl in 2122) [],result of:\n 5.7317085 > =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 4.168515 = > idf(docFreq=663,docCount=42874)\n 1.375 = tfNorm,computed > from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n > 0.0 = parameter b (normsomitted for field)\n 4.7657394 =weight(All:2x in > 2122) [], result of:\n 4.7657394 = score(doc=2122,freq=2.0 = > termFreq=2.0\n), productof:\n 3.4659925 =idf(docFreq=1339, > docCount=42874)\n 1.375 = tfNorm, computed from:\n 2.0 = > termFreq=2.0\n 1.2= parameter k1\n 0.0 = parameterb > (norms omitted for field)\n 5.390302= weight(All:2g in 2122) [], result > of:\n 5.390302 = score(doc=2122,freq=2.0 = termFreq=2.0\n), product > of:\n 3.9202197 = idf(docFreq=850,docCount=42874)\n 1.375 = > tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 = > parameter k1\n 0.0 = parameter b (norms omitted forfield)\n > 7.526294 = max of:\n 5.8597584 = weight(All:powerpoint in 2122)[], > result of:\n 5.8597584 =score(doc=2122,freq=2.0 = termFreq=2.0\n), > product of:\n 4.2616425 = idf(docFreq=604,docCount=42874)\n > 1.375 = tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 > = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n > 7.526294 =weight(All:\"socket outlet\" in 2122) [], result of:\n > 7.526294 = score(doc=2122,freq=1.0 =phraseFreq=1.0\n), product > of:\n 7.526294 = idf(), sum of:\n 3.4955401 = > idf(docFreq=1300, docCount=42874)\n 4.030754 = > idf(docFreq=761,docCount=42874)\n 1.0 = tfNorm,computed > from:\n 1.0 =phraseFreq=1.0\n 1.2 = > parameterk1\n 0.0 = parameter b (normsomitted for field)\n", > > > > My Questions > > 1. IDF : I understand from solr documents that IDFis calculated for > each separate shards, I have added the following stats cacheconfig to > solrconfig.xml and reloaded collection > > > > But even after that there is no change incalculated IDF. > > 2. What are parameter b and parameter K1? > > 3. Why there are lots of parameters included in myDoc15 rather than > Doc1? > > Is there any documentations I can refer to understand thesolr query > calculations in depth. > > We are using Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3 > replicas. > > Regards, > Lavanya > -- *Doug Turnbull **| CTO* | OpenSource Connections <http://opensourceconnections.com>, LLC | 240.476.9983 Author: Relevant Search <http://manning.com/turnbull> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Debugging Solr Search results & Issues with Distributed IDF
Hi, I am trying to debug a query to find out why one documentgets more score than the other. The below are two similar products. Below is the debug results I get from Solr admin console. "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n 4.7573533= weight(All:2x in 962) [], result of:\n 4.7573533 = score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n 3.4598935 = idf(docFreq=1346, docCount=42836)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 10.452296 = max of:\n 5.9166136 = weight(All:powerpoint in 962)[], result of:\n 5.9166136 =score(doc=962,freq=2.0 = termFreq=2.0\n), product of:\n 4.302992 = idf(docFreq=579,docCount=42836)\n 1.375 = tfNorm,computed from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n 10.452296 =weight(All:\"socket outlet\" in 962) [], result of:\n 10.452296 = score(doc=962,freq=2.0 =phraseFreq=2.0\n), product of:\n 7.60167 = idf(), sum of:\n 3.5370626 = idf(docFreq=1246, docCount=42836)\n 4.064607 = idf(docFreq=735,docCount=42836)\n 1.375 = tfNorm,computed from:\n 2.0 =phraseFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n", "Doc15":"\n13.258003 = sum of:\n 5.7317085 = max of:\n 5.7317085 = weight(All:doubl in 2122) [],result of:\n 5.7317085 =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 4.168515 = idf(docFreq=663,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 =termFreq=2.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n 4.7657394 =weight(All:2x in 2122) [], result of:\n 4.7657394 = score(doc=2122,freq=2.0 = termFreq=2.0\n), productof:\n 3.4659925 =idf(docFreq=1339, docCount=42874)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2= parameter k1\n 0.0 = parameterb (norms omitted for field)\n 5.390302= weight(All:2g in 2122) [], result of:\n 5.390302 = score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 3.9202197 = idf(docFreq=850,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 7.526294 = max of:\n 5.8597584 = weight(All:powerpoint in 2122)[], result of:\n 5.8597584 =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n 4.2616425 = idf(docFreq=604,docCount=42874)\n 1.375 = tfNorm,computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted forfield)\n 7.526294 =weight(All:\"socket outlet\" in 2122) [], result of:\n 7.526294 = score(doc=2122,freq=1.0 =phraseFreq=1.0\n), product of:\n 7.526294 = idf(), sum of:\n 3.4955401 = idf(docFreq=1300, docCount=42874)\n 4.030754 = idf(docFreq=761,docCount=42874)\n 1.0 = tfNorm,computed from:\n 1.0 =phraseFreq=1.0\n 1.2 = parameterk1\n 0.0 = parameter b (normsomitted for field)\n", My Questions 1. IDF : I understand from solr documents that IDFis calculated for each separate shards, I have added the following stats cacheconfig to solrconfig.xml and reloaded collection But even after that there is no change incalculated IDF. 2. What are parameter b and parameter K1? 3. Why there are lots of parameters included in myDoc15 rather than Doc1? Is there any documentations I can refer to understand thesolr query calculations in depth. We are using Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3 replicas. Regards, Lavanya