RE: Why do documents without the search query term rank highest
WOW! Thanks Chris - I have read your feedback but I will need to go through it a couple more times to get my head around it :) - thanks for taking the time to help - much appreciated! Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tuesday, December 01, 2015 3:52 PM To: solr-user@lucene.apache.org Subject: RE: Why do documents without the search query term rank highest : Again, my confusion is why the document 'Home' appears ahead of the : document 'Big Mac' in the ranking when the query term 'big' only appears : once in 'Home' but several times in 'Big Mac'? The key to understanding how documents are scored is in the query structure and the "explain" output. By default the explain output is a simple string using newlines & whitespace indenting for formatting -- something that got lost when you pasted it into email -- but i've tried to reformat it below based on educated guesses and lots of experience. (FWIW: adding debug.explain.structured=true will use the xml/json/whatever response format for structure instead of newlines + indenting) http://www-a4.staging.mcdonalds.com/us/en/home.html;> 0.027089478 = (MATCH) product of: .0.18962634 = (MATCH) sum of: ..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity] ...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.56678677 = fieldWeight in 78, product of: .1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.109375 = fieldNorm(doc=78) .0.14285715 = coord(1/7) So what the above tells us, is that the top scoring document (home.html) matched a single clause of the query which was "keywords:big". The *term* "keywords:big" appeared 1 time (freq=1.0) in this document, and is in a total of 3 documents (docFreq). (note that *term* is key here -- the number of times the *word* big appears in all fields doesn't matter for score calculations, just that it appears in the "keywords" field for a total of 3 documents, and this is one of them) There were "penalties" to the score for this document based on the "fieldNorm" of the keywords field (which comes from index time document & field boosts, as well as field length at index time) and because it only matched 1/7 of the clauses of the query. Now lets compare with the second match http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html;> 0.0075755017 = (MATCH) product of: .0.026514255 = (MATCH) sum of: ..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity] ...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.043826047 = fieldWeight in 104, product of: .1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.0048828125 = fieldNorm(doc=104) ..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity] ...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.035424173 = fieldWeight in 104, product of: .1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.0068359375 = fieldNorm(doc=104) .0.2857143 = coord(2/7) In this case, the document matches two clauses of the query -- "description:big" and "title:big". The term description:big is matched 3 times (termFreq) in this document, and evidently exists in only 3 documents in the index (docFreq) but the fieldNorm is penalizing the overall scores. Likewise the term title:big is matched 1 time, and exists in only 3 documents in your index -- the fieldNorm is slightly higher (probably due to the shorter length of the title). The overall score of the second doc is penalized for only matching 2 of the 7 clauses. Based on what i'm seeing here, the biggest suprise i have is the fieldNorm values you are getting -- they don't make sense given the lengths of the fields you showed us in the output unless some index time document (or field) boosts are getting applied -- perhaps intended to "promote" the "home.html" page in your search results? My guess is a some setting in your CMS is doing this? maybe based on "page depth" or something like that? Based on your configs, I'm guessing you're running Solr 4.2 -- So I tr
RE: Why do documents without the search query term rank highest
: Again, my confusion is why the document 'Home' appears ahead of the : document 'Big Mac' in the ranking when the query term 'big' only appears : once in 'Home' but several times in 'Big Mac'? The key to understanding how documents are scored is in the query structure and the "explain" output. By default the explain output is a simple string using newlines & whitespace indenting for formatting -- something that got lost when you pasted it into email -- but i've tried to reformat it below based on educated guesses and lots of experience. (FWIW: adding debug.explain.structured=true will use the xml/json/whatever response format for structure instead of newlines + indenting) http://www-a4.staging.mcdonalds.com/us/en/home.html;> 0.027089478 = (MATCH) product of: .0.18962634 = (MATCH) sum of: ..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity] ...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.56678677 = fieldWeight in 78, product of: .1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.109375 = fieldNorm(doc=78) .0.14285715 = coord(1/7) So what the above tells us, is that the top scoring document (home.html) matched a single clause of the query which was "keywords:big". The *term* "keywords:big" appeared 1 time (freq=1.0) in this document, and is in a total of 3 documents (docFreq). (note that *term* is key here -- the number of times the *word* big appears in all fields doesn't matter for score calculations, just that it appears in the "keywords" field for a total of 3 documents, and this is one of them) There were "penalties" to the score for this document based on the "fieldNorm" of the keywords field (which comes from index time document & field boosts, as well as field length at index time) and because it only matched 1/7 of the clauses of the query. Now lets compare with the second match http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html;> 0.0075755017 = (MATCH) product of: .0.026514255 = (MATCH) sum of: ..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity] ...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.043826047 = fieldWeight in 104, product of: .1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.0048828125 = fieldNorm(doc=104) ..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity] ...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: .5.18205 = idf(docFreq=3, maxDocs=262) .0.06456205 = queryNorm 0.035424173 = fieldWeight in 104, product of: .1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .5.18205 = idf(docFreq=3, maxDocs=262) .0.0068359375 = fieldNorm(doc=104) .0.2857143 = coord(2/7) In this case, the document matches two clauses of the query -- "description:big" and "title:big". The term description:big is matched 3 times (termFreq) in this document, and evidently exists in only 3 documents in the index (docFreq) but the fieldNorm is penalizing the overall scores. Likewise the term title:big is matched 1 time, and exists in only 3 documents in your index -- the fieldNorm is slightly higher (probably due to the shorter length of the title). The overall score of the second doc is penalized for only matching 2 of the 7 clauses. Based on what i'm seeing here, the biggest suprise i have is the fieldNorm values you are getting -- they don't make sense given the lengths of the fields you showed us in the output unless some index time document (or field) boosts are getting applied -- perhaps intended to "promote" the "home.html" page in your search results? My guess is a some setting in your CMS is doing this? maybe based on "page depth" or something like that? Based on your configs, I'm guessing you're running Solr 4.2 -- So I tried loading up copies of those 2 documents using the config+schema you provided, and here are the score explanations i got... **NOTE** Things like the docFreqs (and therfore queryWeight & fieldWeight) are NOT going to be comparable because my index *only* had those two documents ... the key here is to compare the fieldNorms below with the fieldNorms from the same documents in your query... http://www-a4.staging.mcdonalds.com/us/en/home.html 0.004108005 = (MATCH) product of: .0.028756034 = (MATCH) sum of: ..0.028756034 = (MATCH) weight(keywords:big in 0) [DefaultSimilarity], ...0.028756034 = score(doc=0,freq=1.0 = termFreq=1.0), product of: 0.2629123 = queryWeight, product of: .1.0 = idf(docFreq=1, maxDocs=2) .0.2629123
RE: Why do documents without the search query term rank highest
Thank you! Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: Tuesday, December 01, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Re: Why do documents without the search query term rank highest I would suggest you ask on a forum related to Adobe CQ. There are many ways in which CQ could be issuing queries against Solr, and without insight into that, people here aren't that likely to be able to help you - unless they happen to also use CQ, which probably amounts to a very small portion of this community. Upayavira On Tue, Dec 1, 2015, at 04:36 PM, Scotten Stuart wrote: > Hi All, > > I hope this is the way to ask a question - please guide me if there is > a different protocol > > I have a question about results ranking for Solr V4.2 in combination > with the CMS tool Adobe CQ (V5.6). > > Despite trying different ways to configure the ranking of documents I > am confused why content that does not have even one mention of the > search query ranks higher than documents that are actually titled with > the search query. > > For example, searching for 'big' bring back 'Home' as the top result > and 'Big Mac as the second result - see here > http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?s > earch=simple=burger=usmcd > > > Any thoughts would be very welcome > > Thanks > Stuart > PMP, Business Technical Analyst | CRS Consultant | Corporate IT > Digital | McDonald's Corporation > 2111 McDonald's Drive | Oak Brook, IL 60523 USA > Office: +1 630.623.5950 | Cell: 301.633.3298 | > stuart.scot...@us.mcd.com > > > > > > The information contained in this e-mail and any accompanying > documents is confidential, may be privileged, and is intended solely > for the person and/or entity to whom it is addressed (i.e. those identified > in the "To" > and "cc" box). They are the property of McDonald's Corporation. > Unauthorized review, use, disclosure, or copying of this > communication, or any part thereof, is strictly prohibited and may be > unlawful. If you have received this e-mail in error, please return the > e-mail and attachments to the sender and delete the e-mail and > attachments and any copy from your system. McDonald's thanks you for your > cooperation. The information contained in this e-mail and any accompanying documents is confidential, may be privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of McDonald's Corporation. Unauthorized review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please return the e-mail and attachments to the sender and delete the e-mail and attachments and any copy from your system. McDonald's thanks you for your cooperation.
Re: Why do documents without the search query term rank highest
: I would suggest you ask on a forum related to Adobe CQ. There are many : ways in which CQ could be issuing queries against Solr, and without : insight into that, people here aren't that likely to be able to help you : - unless they happen to also use CQ, which probably amounts to a very : small portion of this community. If you have access to the Solr logs, and can provide us with the configs, schema, and requests being made by your frontend, then folks might be able to help explain the results. In particular, if you can figure out what query the front end is making, then make that same query with "debug=true" added to the request, and provide that entire output here, then that will help explain everything about the queries being executed and wy the results are getting the scores hey have... https://wiki.apache.org/solr/UsingMailingLists : > For example, searching for 'big' bring back 'Home' as the top result : > and 'Big Mac as the second result - see here : > http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd FWIW: That URL doesn't do a search for "big" ... pretty sure you ment... http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd -Hoss http://www.lucidworks.com/
Re: Why do documents without the search query term rank highest
I would suggest you ask on a forum related to Adobe CQ. There are many ways in which CQ could be issuing queries against Solr, and without insight into that, people here aren't that likely to be able to help you - unless they happen to also use CQ, which probably amounts to a very small portion of this community. Upayavira On Tue, Dec 1, 2015, at 04:36 PM, Scotten Stuart wrote: > Hi All, > > I hope this is the way to ask a question - please guide me if there is a > different protocol > > I have a question about results ranking for Solr V4.2 in combination with > the CMS tool Adobe CQ (V5.6). > > Despite trying different ways to configure the ranking of documents I am > confused why content that does not have even one mention of the search > query ranks higher than documents that are actually titled with the > search query. > > For example, searching for 'big' bring back 'Home' as the top result and > 'Big Mac as the second result - see here > http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd > > > Any thoughts would be very welcome > > Thanks > Stuart > PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | > McDonald's Corporation > 2111 McDonald's Drive | Oak Brook, IL 60523 USA > Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com > > > > > > The information contained in this e-mail and any accompanying documents > is confidential, may be privileged, and is intended solely for the person > and/or entity to whom it is addressed (i.e. those identified in the "To" > and "cc" box). They are the property of McDonald's Corporation. > Unauthorized review, use, disclosure, or copying of this communication, > or any part thereof, is strictly prohibited and may be unlawful. If you > have received this e-mail in error, please return the e-mail and > attachments to the sender and delete the e-mail and attachments and any > copy from your system. McDonald's thanks you for your cooperation.
RE: Why do documents without the search query term rank highest
Thank you for the feedback - I will need some time to put together the response to your suggestions - and you're right, I did get the search URL wrong - just a beginner at this! Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tuesday, December 01, 2015 10:54 AM To: solr-user@lucene.apache.org Subject: Re: Why do documents without the search query term rank highest : I would suggest you ask on a forum related to Adobe CQ. There are many : ways in which CQ could be issuing queries against Solr, and without : insight into that, people here aren't that likely to be able to help you : - unless they happen to also use CQ, which probably amounts to a very : small portion of this community. If you have access to the Solr logs, and can provide us with the configs, schema, and requests being made by your frontend, then folks might be able to help explain the results. In particular, if you can figure out what query the front end is making, then make that same query with "debug=true" added to the request, and provide that entire output here, then that will help explain everything about the queries being executed and wy the results are getting the scores hey have... https://wiki.apache.org/solr/UsingMailingLists : > For example, searching for 'big' bring back 'Home' as the top result : > and 'Big Mac as the second result - see here : > http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd FWIW: That URL doesn't do a search for "big" ... pretty sure you ment... http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd -Hoss http://www.lucidworks.com/ The information contained in this e-mail and any accompanying documents is confidential, may be privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of McDonald's Corporation. Unauthorized review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please return the e-mail and attachments to the sender and delete the e-mail and attachments and any copy from your system. McDonald's thanks you for your cooperation.
RE: Why do documents without the search query term rank highest
s=262) 0.06456205 = queryNorm 0.0026707677 = fieldWeight in 66, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.4697323 = idf(docFreq=2, maxDocs=262) 4.8828125E-4 = fieldNorm(doc=66) 0.0016930922 = (MATCH) weight(title:big in 66) [DefaultSimilarity], result of: 0.0016930922 = score(doc=66,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: 5.18205 = idf(docFreq=3, maxDocs=262) 0.06456205 = queryNorm 0.005060596 = fieldWeight in 66, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.18205 = idf(docFreq=3, maxDocs=262) 9.765625E-4 = fieldNorm(doc=66) 0.42857143 = coord(3/7) http://www-a4.staging.mcdonalds.com/us/en/food/food_quality/see_what_we_are_made_of/meet_our_suppliers/keystone_foods.html;> 8.465462E-4 = (MATCH) product of: 0.005925823 = (MATCH) sum of: 0.005925823 = (MATCH) weight(keywords:big in 28) [DefaultSimilarity], result of: 0.005925823 = score(doc=28,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, product of: 5.18205 = idf(docFreq=3, maxDocs=262) 0.06456205 = queryNorm 0.017712086 = fieldWeight in 28, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.18205 = idf(docFreq=3, maxDocs=262) 0.0034179688 = fieldNorm(doc=28) 0.14285715 = coord(1/7) LuceneQParser 4.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tuesday, December 01, 2015 10:54 AM To: solr-user@lucene.apache.org Subject: Re: Why do documents without the search query term rank highest : I would suggest you ask on a forum related to Adobe CQ. There are many : ways in which CQ could be issuing queries against Solr, and without : insight into that, people here aren't that likely to be able to help you : - unless they happen to also use CQ, which probably amounts to a very : small portion of this community. If you have access to the Solr logs, and can provide us with the configs, schema, and requests being made by your frontend, then folks might be able to help explain the results. In particular, if you can figure out what query the front end is making, then make that same query with "debug=true" added to the request, and provide that entire output here, then that will help explain everything about the queries being executed and wy the results are getting the scores hey have... https://wiki.apache.org/solr/UsingMailingLists : > For example, searching for 'big' bring back 'Home' as the top result : > and 'Big Mac as the second result - see here : > http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd FWIW: That URL doesn't do a search for "big" ... pretty sure you ment... http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd -Hoss http://www.lucidworks.com/ The information contained in this e-mail and any accompanying documents is confidential, may be privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of McDonald's Corporation. Unauthorized review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please return the e-mail and attachments to the sender and delete the e-mail and attachments and any copy from your system. McDonald's thanks you for your cooperation. config.docx Description: config.docx schema.docx Description: schema.docx
Why do documents without the search query term rank highest
Hi All, I hope this is the way to ask a question - please guide me if there is a different protocol I have a question about results ranking for Solr V4.2 in combination with the CMS tool Adobe CQ (V5.6). Despite trying different ways to configure the ranking of documents I am confused why content that does not have even one mention of the search query ranks higher than documents that are actually titled with the search query. For example, searching for 'big' bring back 'Home' as the top result and 'Big Mac as the second result - see here http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd Any thoughts would be very welcome Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com The information contained in this e-mail and any accompanying documents is confidential, may be privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of McDonald's Corporation. Unauthorized review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please return the e-mail and attachments to the sender and delete the e-mail and attachments and any copy from your system. McDonald's thanks you for your cooperation.