Re: Difference in queryString and Parsed query

2019-01-22 Thread Lavanya Thirumalaisami
 
Thank you Walter Under Underwood for a complete honest review. 
I will start simple by using the sample. 
Regards,LavanyaOn Tuesday, 22 January 2019, 12:31:55 pm AEDT, Walter 
Underwood  wrote:  
 
 There are many, many problems with this analyzer chain definition.

This is a summary of the indexing chain:

* WhitespaceTokenizerFilter
* LowerCaseFilter
* SynonymFilter (with ignoreCase=true after lower-casing everything)
* StopFilter (we should have stopped using stopwords 20 years ago)
* WordDelimiterFilter (with all the transformation options set to 0, does 
nothing)
* RemoveDuplicates (this must always be last)
* KStemFilter (good choice)
* EdgeNGramFilter (!!! are you doing prefix matching? doing that with stemming 
makes bizarre matches)
* ReverseStringFilter (Yowza! Only do this on unmodified tokens, what does this 
mean on word stems? Even more bizarre)

Reversed stemmed edge ngrams should cause some really exciting matches. 

Summary of the query chain:

* WhitespaceTokenizerFilter
* LowerCaseFilter
* PorterStemFilter (different stemmer from indexing, guarantees missed matches)
* SynonymFilter (after stemmer? never do this, all tokens need stemmed)
* StopFilter (bad, but extra bad after a Porter stemmer that doesn’t generate 
dictionary words)
* WordDelimiterFilter (again, doing nothing, also the results should have been 
stemmed)
* KStemFilter (two stemmers in a chain! never do that! plus the Porter stemmer 
doesn’t produce dictionary words, so KStem won’t do much)

Short version, I’m astonished that this configuration works at all. Delete the 
whole thing, use one from the sample file (without stop words), and reindex. 
There is no way to fix this. Not to be mean, but this is the worst field type 
definition I have ever seen.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 21, 2019, at 4:24 AM, Lavanya Thirumalaisami 
>  wrote:
> 
> 
> Thank you Aman Deep 
> I tried removing the kstem filter factory and still get the same issue, but 
> when i comment the Porterstemfilterfactory the character y does not get 
> replaced. 
> 
>    On Monday, 21 January 2019, 11:16:23 pm AEDT, Aman deep singh 
> wrote:  
> 
> Hi Lavanya,
> This is probably due to the kstem Filter factory it is removing the y 
> charactor ,since the stemmer has rule of words ending with y .
> 
> 
> Regards,
> Aman Deep Singh
> 
>> On 21-Jan-2019, at 5:43 PM, Mikhail Khludnev  wrote:
>> 
>> querystring  is what goes into QPaser,  parsedquery  is
>> LuceneQuery.toString()
>> 
>> On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami
>>  wrote:
>> 
>>> Hi,
>>> Our solr search is not returning expected results for keywords ending with
>>> the character 'y'.
>>> For example keywords like battery, way, accessory etc.
>>> I tried debugging the solr query in solr admin console and i find there is
>>> a difference between query string and parsed query.
>>> "querystring":"battery","parsedquery":"batteri",
>>> Also I find that if i search omitting the character y i am getting all the
>>> results.
>>> This happens only for keywords ending with Y and most others we donot have
>>> this issue.
>>> Could any one please help me understand why is the keywords gets changed,
>>> specially the last character. Is there any issues in my field type
>>> definition.
>>> While indexing the data we use the text data type and we have defined as
>>> follows
>>> >> positionIncrementGap="100">  >> class="solr.WhitespaceTokenizerFactory" /> >> class="solr.LowerCaseFilterFactory" /> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>>> expand="true"/> >> words="stopwords.txt" /> >> catenateWords="1" class="solr.WordDelimiterFilterFactory"
>>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>>> splitOnCaseChange="0" splitOnNumerics="0" /> >> class="solr.RemoveDuplicatesTokenFilterFactory" /> >> class="solr.KStemFilterFactory" /> >> class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" />
>>>    >> type="query">  >> class="solr.LowerCaseFilterFactory" /> >> class="solr.PorterStemFilterFactory" /> >> class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>>> expand="true"/> >> words="stopwords.txt" /> >> catenateWords="0" class="solr.WordDelimiterFilterFactory"
>>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>>> splitOnCaseChange="0" splitOnNumerics="0" /> >> class="solr.KStemFilterFactory" />  
>>> 
>>> Regards,Lavanya
>> 
>> 
>> 
>> -- 
>> Sincerely yours
>> Mikhail Khludnev
  

Re: Difference in queryString and Parsed query

2019-01-21 Thread Lavanya Thirumalaisami
 
Thank you Aman Deep 
I tried removing the kstem filter factory and still get the same issue, but 
when i comment the Porterstemfilterfactory the character y does not get 
replaced. 

On Monday, 21 January 2019, 11:16:23 pm AEDT, Aman deep singh 
 wrote:  
 
 Hi Lavanya,
This is probably due to the kstem Filter factory it is removing the y charactor 
,since the stemmer has rule of words ending with y .


Regards,
Aman Deep Singh

> On 21-Jan-2019, at 5:43 PM, Mikhail Khludnev  wrote:
> 
> querystring  is what goes into QPaser,  parsedquery  is
> LuceneQuery.toString()
> 
> On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami
>  wrote:
> 
>> Hi,
>> Our solr search is not returning expected results for keywords ending with
>> the character 'y'.
>> For example keywords like battery, way, accessory etc.
>> I tried debugging the solr query in solr admin console and i find there is
>> a difference between query string and parsed query.
>> "querystring":"battery","parsedquery":"batteri",
>> Also I find that if i search omitting the character y i am getting all the
>> results.
>> This happens only for keywords ending with Y and most others we donot have
>> this issue.
>> Could any one please help me understand why is the keywords gets changed,
>> specially the last character. Is there any issues in my field type
>> definition.
>> While indexing the data we use the text data type and we have defined as
>> follows
>> > positionIncrementGap="100">  > class="solr.WhitespaceTokenizerFactory" /> > class="solr.LowerCaseFilterFactory" /> > class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>> expand="true"/> > words="stopwords.txt" /> > catenateWords="1" class="solr.WordDelimiterFilterFactory"
>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>> splitOnCaseChange="0" splitOnNumerics="0" /> > class="solr.RemoveDuplicatesTokenFilterFactory" /> > class="solr.KStemFilterFactory" /> > class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" />
>>    > type="query">  > class="solr.LowerCaseFilterFactory" /> > class="solr.PorterStemFilterFactory" /> > class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
>> expand="true"/> > words="stopwords.txt" /> > catenateWords="0" class="solr.WordDelimiterFilterFactory"
>> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
>> splitOnCaseChange="0" splitOnNumerics="0" /> > class="solr.KStemFilterFactory" />  
>> 
>> Regards,Lavanya
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
  

Re: Difference in queryString and Parsed query

2019-01-21 Thread Lavanya Thirumalaisami
 thank you Mikhail. On Monday, 21 January 2019, 11:13:51 pm AEDT, Mikhail 
Khludnev  wrote:  
 
 querystring  is what goes into QPaser,  parsedquery  is
LuceneQuery.toString()

On Mon, Jan 21, 2019 at 3:04 PM Lavanya Thirumalaisami
 wrote:

> Hi,
> Our solr search is not returning expected results for keywords ending with
> the character 'y'.
> For example keywords like battery, way, accessory etc.
> I tried debugging the solr query in solr admin console and i find there is
> a difference between query string and parsed query.
> "querystring":"battery","parsedquery":"batteri",
> Also I find that if i search omitting the character y i am getting all the
> results.
> This happens only for keywords ending with Y and most others we donot have
> this issue.
> Could any one please help me understand why is the keywords gets changed,
> specially the last character. Is there any issues in my field type
> definition.
> While indexing the data we use the text data type and we have defined as
> follows
>   positionIncrementGap="100">   class="solr.WhitespaceTokenizerFactory" />  class="solr.LowerCaseFilterFactory" />  class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
> expand="true"/>  words="stopwords.txt" />  catenateWords="1" class="solr.WordDelimiterFilterFactory"
> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
> splitOnCaseChange="0" splitOnNumerics="0" />  class="solr.RemoveDuplicatesTokenFilterFactory" />  class="solr.KStemFilterFactory" />  class="solr.EdgeNGramFilterFactory" maxGramSize="255" minGramSize="1" />
>     type="query">   class="solr.LowerCaseFilterFactory" />  class="solr.PorterStemFilterFactory" />  class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
> expand="true"/>  words="stopwords.txt" />  catenateWords="0" class="solr.WordDelimiterFilterFactory"
> generateNumberParts="0" generateWordParts="0" preserveOriginal="1"
> splitOnCaseChange="0" splitOnNumerics="0" />  class="solr.KStemFilterFactory" />  
>
> Regards,Lavanya



-- 
Sincerely yours
Mikhail Khludnev
  

Difference in queryString and Parsed query

2019-01-21 Thread Lavanya Thirumalaisami
Hi,
Our solr search is not returning expected results for keywords ending with the 
character 'y'. 
For example keywords like battery, way, accessory etc. 
I tried debugging the solr query in solr admin console and i find there is a 
difference between query string and parsed query. 
"querystring":"battery","parsedquery":"batteri",
Also I find that if i search omitting the character y i am getting all the 
results. 
This happens only for keywords ending with Y and most others we donot have this 
issue. 
Could any one please help me understand why is the keywords gets changed, 
specially the last character. Is there any issues in my field type definition. 
While indexing the data we use the text data type and we have defined as follows
  
  
   


Regards,Lavanya

Re: Debugging Solr Search results & Issues with Distributed IDF

2019-01-06 Thread Lavanya Thirumalaisami
 Thank you for the inputs Doug  and Charlie. 
On Wednesday, 2 January 2019, 11:39:13 pm AEDT, Doug Turnbull 
 wrote:  
 
 On (2) these are BM25 parameters. There are several articles that discuss
BM25 in depth

https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables





On Tue, Jan 1, 2019 at 6:04 PM Lavanya Thirumalaisami
 wrote:

>
> Hi,
>
> I am trying to debug a query to find out why one documentgets more score
> than the other. The below are two similar products.
>
> Below is the debug results I get from Solr admin console.
>
>  "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n    4.7573533=
> weight(All:2x in 962) [], result of:\n          4.7573533 =
> score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n      3.4598935 =
> idf(docFreq=1346, docCount=42836)\n        1.375 = tfNorm, computed
> from:\n          2.0 = termFreq=2.0\n          1.2 = parameter
> k1\n          0.0 = parameter b (norms omitted forfield)\n  10.452296 = max
> of:\n    5.9166136 = weight(All:powerpoint in 962)[], result of:\n
> 5.9166136 =score(doc=962,freq=2.0 = termFreq=2.0\n), product of:\n
> 4.302992 = idf(docFreq=579,docCount=42836)\n        1.375 = tfNorm,computed
> from:\n          2.0 =termFreq=2.0\n          1.2 = parameterk1\n
> 0.0 = parameter b (normsomitted for field)\n    10.452296
> =weight(All:\"socket outlet\" in 962) [], result of:\n      10.452296 =
> score(doc=962,freq=2.0 =phraseFreq=2.0\n), product of:\n      7.60167 =
> idf(), sum of:\n        3.5370626 = idf(docFreq=1246,
> docCount=42836)\n          4.064607 =
> idf(docFreq=735,docCount=42836)\n        1.375 = tfNorm,computed
> from:\n          2.0 =phraseFreq=2.0\n          1.2 =
> parameterk1\n          0.0 = parameter b (normsomitted for field)\n",
>
> "Doc15":"\n13.258003 = sum of:\n  5.7317085 = max of:\n    5.7317085 =
> weight(All:doubl in 2122) [],result of:\n      5.7317085
> =score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n        4.168515 =
> idf(docFreq=663,docCount=42874)\n        1.375 = tfNorm,computed
> from:\n          2.0 =termFreq=2.0\n          1.2 = parameterk1\n
> 0.0 = parameter b (normsomitted for field)\n    4.7657394 =weight(All:2x in
> 2122) [], result of:\n    4.7657394 = score(doc=2122,freq=2.0 =
> termFreq=2.0\n), productof:\n        3.4659925 =idf(docFreq=1339,
> docCount=42874)\n      1.375 = tfNorm, computed from:\n        2.0 =
> termFreq=2.0\n          1.2= parameter k1\n          0.0 = parameterb
> (norms omitted for field)\n    5.390302= weight(All:2g in 2122) [], result
> of:\n    5.390302 = score(doc=2122,freq=2.0 = termFreq=2.0\n), product
> of:\n        3.9202197 = idf(docFreq=850,docCount=42874)\n        1.375 =
> tfNorm,computed from:\n          2.0 = termFreq=2.0\n          1.2 =
> parameter k1\n          0.0 = parameter b (norms omitted forfield)\n
> 7.526294 = max of:\n    5.8597584 = weight(All:powerpoint in 2122)[],
> result of:\n      5.8597584 =score(doc=2122,freq=2.0 = termFreq=2.0\n),
> product of:\n        4.2616425 = idf(docFreq=604,docCount=42874)\n
> 1.375 = tfNorm,computed from:\n          2.0 = termFreq=2.0\n          1.2
> = parameter k1\n          0.0 = parameter b (norms omitted forfield)\n
> 7.526294 =weight(All:\"socket outlet\" in 2122) [], result of:\n
> 7.526294 = score(doc=2122,freq=1.0 =phraseFreq=1.0\n), product
> of:\n      7.526294 = idf(), sum of:\n        3.4955401 =
> idf(docFreq=1300, docCount=42874)\n          4.030754 =
> idf(docFreq=761,docCount=42874)\n        1.0 = tfNorm,computed
> from:\n          1.0 =phraseFreq=1.0\n          1.2 =
> parameterk1\n          0.0 = parameter b (normsomitted for field)\n",
>
>
>
> My Questions
>
> 1.      IDF : I understand from solr documents that IDFis calculated for
> each separate shards, I have added the following stats cacheconfig to
> solrconfig.xml and reloaded collection
>
> 
>
> But even after that there is no change incalculated IDF.
>
> 2.      What are parameter b and parameter K1?
>
> 3.      Why there are lots of parameters included in myDoc15 rather than
> Doc1?
>
> Is there any documentations I can refer to understand thesolr query
> calculations in depth.
>
> We are using  Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3
> replicas.
>
> Regards,
> Lavanya
>
-- 
*Doug Turnbull **| CTO* | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
  

Debugging Solr Search results & Issues with Distributed IDF

2019-01-01 Thread Lavanya Thirumalaisami

Hi,

I am trying to debug a query to find out why one documentgets more score than 
the other. The below are two similar products.

Below is the debug results I get from Solr admin console. 

 "Doc1": "\n15.20965 = sum of:\n 4.7573533 = max of:\n    4.7573533= 
weight(All:2x in 962) [], result of:\n   4.7573533 = 
score(doc=962,freq=2.0 =termFreq=2.0\n), product of:\n   3.4598935 = 
idf(docFreq=1346, docCount=42836)\n    1.375 = tfNorm, computed from:\n 
 2.0 = termFreq=2.0\n  1.2 = parameter k1\n  0.0 = 
parameter b (norms omitted forfield)\n  10.452296 = max of:\n    5.9166136 = 
weight(All:powerpoint in 962)[], result of:\n  5.9166136 
=score(doc=962,freq=2.0 = termFreq=2.0\n), product of:\n    4.302992 = 
idf(docFreq=579,docCount=42836)\n    1.375 = tfNorm,computed from:\n
  2.0 =termFreq=2.0\n  1.2 = parameterk1\n  0.0 = parameter b 
(normsomitted for field)\n    10.452296 =weight(All:\"socket outlet\" in 962) 
[], result of:\n  10.452296 = score(doc=962,freq=2.0 =phraseFreq=2.0\n), 
product of:\n   7.60167 = idf(), sum of:\n 3.5370626 = 
idf(docFreq=1246, docCount=42836)\n  4.064607 = 
idf(docFreq=735,docCount=42836)\n    1.375 = tfNorm,computed from:\n
  2.0 =phraseFreq=2.0\n  1.2 = parameterk1\n  0.0 = parameter b 
(normsomitted for field)\n",

"Doc15":"\n13.258003 = sum of:\n  5.7317085 = max of:\n    5.7317085 = 
weight(All:doubl in 2122) [],result of:\n  5.7317085 
=score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n    4.168515 = 
idf(docFreq=663,docCount=42874)\n    1.375 = tfNorm,computed from:\n
  2.0 =termFreq=2.0\n  1.2 = parameterk1\n  0.0 = parameter b 
(normsomitted for field)\n    4.7657394 =weight(All:2x in 2122) [], result 
of:\n 4.7657394 = score(doc=2122,freq=2.0 = termFreq=2.0\n), productof:\n   
 3.4659925 =idf(docFreq=1339, docCount=42874)\n   1.375 = tfNorm, 
computed from:\n 2.0 = termFreq=2.0\n  1.2= parameter k1\n  
    0.0 = parameterb (norms omitted for field)\n    5.390302= weight(All:2g in 
2122) [], result of:\n 5.390302 = score(doc=2122,freq=2.0 = 
termFreq=2.0\n), product of:\n    3.9202197 = 
idf(docFreq=850,docCount=42874)\n    1.375 = tfNorm,computed from:\n
  2.0 = termFreq=2.0\n  1.2 = parameter k1\n  0.0 = parameter b 
(norms omitted forfield)\n  7.526294 = max of:\n    5.8597584 = 
weight(All:powerpoint in 2122)[], result of:\n  5.8597584 
=score(doc=2122,freq=2.0 = termFreq=2.0\n), product of:\n    4.2616425 = 
idf(docFreq=604,docCount=42874)\n    1.375 = tfNorm,computed from:\n
  2.0 = termFreq=2.0\n  1.2 = parameter k1\n  0.0 = parameter b 
(norms omitted forfield)\n    7.526294 =weight(All:\"socket outlet\" in 2122) 
[], result of:\n  7.526294 = score(doc=2122,freq=1.0 =phraseFreq=1.0\n), 
product of:\n   7.526294 = idf(), sum of:\n 3.4955401 = 
idf(docFreq=1300, docCount=42874)\n  4.030754 = 
idf(docFreq=761,docCount=42874)\n    1.0 = tfNorm,computed from:\n  
1.0 =phraseFreq=1.0\n  1.2 = parameterk1\n  0.0 = parameter b 
(normsomitted for field)\n",

 

My Questions 

1.  IDF : I understand from solr documents that IDFis calculated for each 
separate shards, I have added the following stats cacheconfig to solrconfig.xml 
and reloaded collection

 

But even after that there is no change incalculated IDF. 

2.  What are parameter b and parameter K1?

3.  Why there are lots of parameters included in myDoc15 rather than Doc1? 

Is there any documentations I can refer to understand thesolr query 
calculations in depth. 

We are using  Solr 6.1in Cloud with 3 zookeepers and 3 masters and 3 replicas.

Regards,
Lavanya