Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Thank you very much Chris. I was not aware of debug.explain.structured. It seems to be what I was looking for. Thanks also to Jack Krupansky. Yes, delving into those numbers would be my next step, but I will get to that later. O. O. Chris Hostetter-3 wrote > Just to be clear, regardless of *which* response writer you use (xml, > ruby, json, etc...) the default behavior is to include the score > explanation sa a single string which uses tabs/newlines to deal with the > nested (this nesting is visible if you view the raw response, no matter > what ResponseWriter) > > You can however add a param indicating that you want the explaantion > information to be returned as a *structured data* instead o a simple > string... > > https://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured > > ...if you wnat to programatically process debug info, this is the > recomended way to to so. > > -Hoss > http://www.lucidworks.com/ -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149521.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
: Thank you very much Erik. This is exactly what I was looking for. While at : the moment I have no clue about these numbers, they ruby formatting makes it : much more easier to understand. Just to be clear, regardless of *which* response writer you use (xml, ruby, json, etc...) the default behavior is to include the score explanation sa a single string which uses tabs/newlines to deal with the nested (this nesting is visible if you view the raw response, no matter what ResponseWriter) You can however add a param indicating that you want the explaantion information to be returned as a *structured data* instead o a simple string... https://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured ...if you wnat to programatically process debug info, this is the recomended way to to so. -Hoss http://www.lucidworks.com/
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
The formatting is one thing, but ultimately it is just a giant expression, one for each document. The expression is computing the score, based on your chosen or default "similarity" algorithm. All the terms in the expressions are detailed here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html Unless you dive into that math (not so bad, really, if you are motivated), the expressions are going to be rather opaque to you. The long floating point numbers are mostly just the intermediate (and final) calculations of the math described above. Try constructing a very simple collection of simple, contrived documents, like a short sentence in each, with some common terms, and then try simply queries to see how the expression term values change. Try computing TF, DF, IDF yourself (just count the terms by hand), and compare to what debug gives you. -- Jack Krupansky -Original Message- From: O. Olson Sent: Thursday, July 24, 2014 6:45 PM To: solr-user@lucene.apache.org Subject: Understanding the Debug explanations for Query Result Scoring/Ranking Hi, If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your current output is not XML)/, you would get a node in the resulting XML that is named "debug". There is a child node to this called "explain" to this which has a list showing why the results are ranked in a particular order. I'm curious if there is some documentation on understanding these numbers/results. I am new to Solr, so I apologize that I may be using the wrong terms to describe my problem. I also aware of http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html though I have not completely understood it. My problem is trying to understand something like this: 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0 = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of: 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109) [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 = termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 = fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of: 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) *Note:* I have searched for "televisions". My search field is a single catch-all field. The Edismax parser seems to break up my search term into "televis" and "tv" Is there some documentation on how to understand these numbers. They do not seem to be properly delimited. At the minimum, I can understand something like: 1.5797625 = 0.4717142 + 1.1080483 and 0.71447384 = 7.0424104 * 0.10145303 But, I cannot understand if something like "0.10145303 = queryNorm 0.660226 = fieldWeight in 44109" is used in the calculation anywhere. Also since there were only two terms /("televis" and "tv")/ I could use subtraction to find out 1.1080483 was the start of a new result. I'd also appreciate if someone can tell me which class dumps out the above data. If I know it, I can edit that class to make the output a bit more understandable for me. Thank you, O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Thank you very much Erik. This is exactly what I was looking for. While at the moment I have no clue about these numbers, they ruby formatting makes it much more easier to understand. Thanks to you Koji. I'm sorry I did not acknowledge you before. I think Erik's solution is what I was looking for. O. O. Erik Hatcher-4 wrote > The format of the XML explain output is not indented or very readable. > When I really need to see the explain indented, I use wt=ruby&indent=true > (I don’t think the indent parameter is relevant for the explain output, > but I use it anyway) > > Erik -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149226.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
The format of the XML explain output is not indented or very readable. When I really need to see the explain indented, I use wt=ruby&indent=true (I don’t think the indent parameter is relevant for the explain output, but I use it anyway) Erik On Jul 25, 2014, at 10:11 AM, O. Olson wrote: > Thank you Uwe. Unfortunately, I could not get your explain solr website to > work. I always get an error saying "Ops. We have internal server error. This > event was logged. We will try fix this soon. We are sorry for > inconvenience." > > At this point, I know that I need to have some technical background to > understanding how these numbers are calculated. However even with that, I am > sure that the format of this output is not obvious. I am curious about the > documentation of this output format. It seems to be unintelligible. > > If this is not documented anywhere, can someone point me to which class is > doing this output. > > Thank you, > O. O. > > > an6 wrote >> Hi, >> >> to get an idea of the meaning of all this numbers, have a look on >> http://explain.solr.pl. I like this tool, it's great. >> >> Uwe > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149217.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Thank you Uwe. Unfortunately, I could not get your explain solr website to work. I always get an error saying "Ops. We have internal server error. This event was logged. We will try fix this soon. We are sorry for inconvenience." At this point, I know that I need to have some technical background to understanding how these numbers are calculated. However even with that, I am sure that the format of this output is not obvious. I am curious about the documentation of this output format. It seems to be unintelligible. If this is not documented anywhere, can someone point me to which class is doing this output. Thank you, O. O. an6 wrote > Hi, > > to get an idea of the meaning of all this numbers, have a look on > http://explain.solr.pl. I like this tool, it's great. > > Uwe -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137p4149217.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Hi, In addition, this might be useful: Fundamentals of Information Retrieval, Illustration with Apache Lucene https://www.youtube.com/watch?v=SCsS5ePGmCs This video is about 40 minutes long, but you can fast forward to 24:00 to learn scoring based on vector space model and how Lucene customize it. Koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html (2014/07/25 8:00), Uwe Reh wrote: Hi, to get an idea of the meaning of all this numbers, have a look on http://explain.solr.pl. I like this tool, it's great. Uwe Am 25.07.2014 00:45, schrieb O. Olson: Hi, If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your current output is not XML)/, you would get a node in the resulting XML that is named "debug". There is a child node to this called "explain" to this which has a list showing why the results are ranked in a particular order. I'm curious if there is some documentation on understanding these numbers/results. I am new to Solr, so I apologize that I may be using the wrong terms to describe my problem. I also aware of http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html though I have not completely understood it. My problem is trying to understand something like this: 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0 = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of: 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109) [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 = termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 = fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of: 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) *Note:* I have searched for "televisions". My search field is a single catch-all field. The Edismax parser seems to break up my search term into "televis" and "tv" Is there some documentation on how to understand these numbers. They do not seem to be properly delimited. At the minimum, I can understand something like: 1.5797625 = 0.4717142 + 1.1080483 and 0.71447384 = 7.0424104 * 0.10145303 But, I cannot understand if something like "0.10145303 = queryNorm 0.660226 = fieldWeight in 44109" is used in the calculation anywhere. Also since there were only two terms /("televis" and "tv")/ I could use subtraction to find out 1.1080483 was the start of a new result. I'd also appreciate if someone can tell me which class dumps out the above data. If I know it, I can edit that class to make the output a bit more understandable for me. Thank you, O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Understanding the Debug explanations for Query Result Scoring/Ranking
Hi, to get an idea of the meaning of all this numbers, have a look on http://explain.solr.pl. I like this tool, it's great. Uwe Am 25.07.2014 00:45, schrieb O. Olson: Hi, If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your current output is not XML)/, you would get a node in the resulting XML that is named "debug". There is a child node to this called "explain" to this which has a list showing why the results are ranked in a particular order. I'm curious if there is some documentation on understanding these numbers/results. I am new to Solr, so I apologize that I may be using the wrong terms to describe my problem. I also aware of http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html though I have not completely understood it. My problem is trying to understand something like this: 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0 = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of: 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109) [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 = termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 = fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of: 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) *Note:* I have searched for "televisions". My search field is a single catch-all field. The Edismax parser seems to break up my search term into "televis" and "tv" Is there some documentation on how to understand these numbers. They do not seem to be properly delimited. At the minimum, I can understand something like: 1.5797625 = 0.4717142 + 1.1080483 and 0.71447384 = 7.0424104 * 0.10145303 But, I cannot understand if something like "0.10145303 = queryNorm 0.660226 = fieldWeight in 44109" is used in the calculation anywhere. Also since there were only two terms /("televis" and "tv")/ I could use subtraction to find out 1.1080483 was the start of a new result. I'd also appreciate if someone can tell me which class dumps out the above data. If I know it, I can edit that class to make the output a bit more understandable for me. Thank you, O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html Sent from the Solr - User mailing list archive at Nabble.com.
Understanding the Debug explanations for Query Result Scoring/Ranking
Hi, If you add /*&debug=true*/ to the Solr request /(and &wt=xml if your current output is not XML)/, you would get a node in the resulting XML that is named "debug". There is a child node to this called "explain" to this which has a list showing why the results are ranked in a particular order. I'm curious if there is some documentation on understanding these numbers/results. I am new to Solr, so I apologize that I may be using the wrong terms to describe my problem. I also aware of http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html though I have not completely understood it. My problem is trying to understand something like this: 1.5797625 = (MATCH) sum of: 0.4717142 = (MATCH) weight(text:televis in 44109) [DefaultSimilarity], result of: 0.4717142 = score(doc=44109,freq=1.0 = termFreq=1.0 ), product of: 0.71447384 = queryWeight, product of: 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.10145303 = queryNorm 0.660226 = fieldWeight in 44109, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0424104 = idf(docFreq=896, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) 1.1080483 = (MATCH) weight(text:tv in 44109) [DefaultSimilarity], result of: 1.1080483 = score(doc=44109,freq=6.0 = termFreq=6.0 ), product of: 0.6996622 = queryWeight, product of: 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.10145303 = queryNorm 1.5836904 = fieldWeight in 44109, product of: 2.4494898 = tf(freq=6.0), with freq of: 6.0 = termFreq=6.0 6.896415 = idf(docFreq=1037, maxDocs=377553) 0.09375 = fieldNorm(doc=44109) *Note:* I have searched for "televisions". My search field is a single catch-all field. The Edismax parser seems to break up my search term into "televis" and "tv" Is there some documentation on how to understand these numbers. They do not seem to be properly delimited. At the minimum, I can understand something like: 1.5797625 = 0.4717142 + 1.1080483 and 0.71447384 = 7.0424104 * 0.10145303 But, I cannot understand if something like "0.10145303 = queryNorm 0.660226 = fieldWeight in 44109" is used in the calculation anywhere. Also since there were only two terms /("televis" and "tv")/ I could use subtraction to find out 1.1080483 was the start of a new result. I'd also appreciate if someone can tell me which class dumps out the above data. If I know it, I can edit that class to make the output a bit more understandable for me. Thank you, O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-the-Debug-explanations-for-Query-Result-Scoring-Ranking-tp4149137.html Sent from the Solr - User mailing list archive at Nabble.com.