RE: Why do documents without the search query term rank highest

2015-12-01 Thread Scotten Stuart
WOW!

Thanks Chris - I have read your feedback but I will need to go through it a 
couple more times to get my head around it :) - thanks for taking the time to 
help - much appreciated!



Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com




-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
Sent: Tuesday, December 01, 2015 3:52 PM
To: solr-user@lucene.apache.org
Subject: RE: Why do documents without the search query term rank highest


: Again, my confusion is why the document 'Home' appears ahead of the
: document 'Big Mac' in the ranking when the query term 'big' only appears
: once in 'Home' but several times in 'Big Mac'?

The key to understanding how documents are scored is in the query structure and 
the "explain" output.

By default the explain output is a simple string using newlines & whitespace 
indenting for formatting -- something that got lost when you pasted it into 
email -- but i've tried to reformat it below based on educated guesses and lots 
of experience. (FWIW: adding debug.explain.structured=true will use the 
xml/json/whatever response format for structure instead of newlines + indenting)

http://www-a4.staging.mcdonalds.com/us/en/home.html;>

0.027089478 = (MATCH) product of:
.0.18962634 = (MATCH) sum of:
..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity]
...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of:
0.3345638 = queryWeight, product of:
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.06456205 = queryNorm
0.56678677 = fieldWeight in 78, product of:
.1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.109375 = fieldNorm(doc=78)
.0.14285715 = coord(1/7)

So what the above tells us, is that the top scoring document (home.html) 
matched a single clause of the query which was "keywords:big".  The *term* 
"keywords:big" appeared 1 time (freq=1.0) in this document, and is in a total 
of 3 documents (docFreq).

(note that *term* is key here -- the number of times the *word* big appears in 
all fields doesn't matter for score calculations, just that it appears in the 
"keywords" field for a total of 3 documents, and this is one of them)

There were "penalties" to the score for this document based on the "fieldNorm" 
of the keywords field (which comes from index time document & field boosts, as 
well as field length at index time) and because it only matched 1/7 of the 
clauses of the query.

Now lets compare with the second match

http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html;>

0.0075755017 = (MATCH) product of:
.0.026514255 = (MATCH) sum of:
..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity]
...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of:
0.3345638 = queryWeight, product of:
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.06456205 = queryNorm
0.043826047 = fieldWeight in 104, product of:
.1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.0048828125 = fieldNorm(doc=104)
..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity]
...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of:
0.3345638 = queryWeight, product of:
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.06456205 = queryNorm
0.035424173 = fieldWeight in 104, product of:
.1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.5.18205 = idf(docFreq=3, maxDocs=262)
.0.0068359375 = fieldNorm(doc=104)
.0.2857143 = coord(2/7)

In this case, the document matches two clauses of the query -- 
"description:big" and "title:big".  The term description:big is matched 3 times 
(termFreq) in this document, and evidently exists in only 3 documents in the 
index (docFreq) but the fieldNorm is penalizing the overall scores.  Likewise 
the term title:big is matched 1 time, and exists in only 3 documents in your 
index -- the fieldNorm is slightly higher (probably due to the shorter length 
of the title).  The overall score of the second doc is penalized for only 
matching 2 of the 7 clauses.

Based on what i'm seeing here, the biggest suprise i have is the fieldNorm 
values you are getting -- they don't make sense given the lengths of the fields 
you showed us in the output unless some index time document (or
field) boosts are getting applied -- perhaps intended to "promote" the 
"home.html" page in your search results?  My guess is a some setting in your 
CMS is doing this?  maybe based on "page depth" or something like that?

Based on your configs, I'm guessing you're running Solr 4.2 -- So I tr

RE: Why do documents without the search query term rank highest

2015-12-01 Thread Chris Hostetter

: Again, my confusion is why the document 'Home' appears ahead of the 
: document 'Big Mac' in the ranking when the query term 'big' only appears 
: once in 'Home' but several times in 'Big Mac'?

The key to understanding how documents are scored is in the query 
structure and the "explain" output.

By default the explain output is a simple string using newlines & 
whitespace indenting for formatting -- something that got lost when you 
pasted it into email -- but i've tried to reformat it below based on 
educated guesses and lots of experience. (FWIW: adding 
debug.explain.structured=true will use the xml/json/whatever response 
format for structure instead of newlines + indenting)

http://www-a4.staging.mcdonalds.com/us/en/home.html;>

0.027089478 = (MATCH) product of: 
.0.18962634 = (MATCH) sum of: 
..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity]
...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of:
0.3345638 = queryWeight, product of: 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.06456205 = queryNorm 
0.56678677 = fieldWeight in 78, product of: 
.1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.109375 = fieldNorm(doc=78) 
.0.14285715 = coord(1/7) 

So what the above tells us, is that the top scoring document (home.html) 
matched a single clause of the query which was "keywords:big".  The *term* 
"keywords:big" appeared 1 time (freq=1.0) in this document, and is in a 
total of 3 documents (docFreq). 

(note that *term* is key here -- the number of times the *word* big 
appears in all fields doesn't matter for score calculations, just that it 
appears in the "keywords" field for a total of 3 documents, and this is 
one of them)

There were "penalties" to the score for this document based on the 
"fieldNorm" of the keywords field (which comes from index time document & 
field boosts, as well as field length at index time) and because it only 
matched 1/7 of the clauses of the query.

Now lets compare with the second match

http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html;>

0.0075755017 = (MATCH) product of: 
.0.026514255 = (MATCH) sum of: 
..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity]
...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of:
0.3345638 = queryWeight, product of: 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.06456205 = queryNorm 
0.043826047 = fieldWeight in 104, product of: 
.1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.0048828125 = fieldNorm(doc=104) 
..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity]
...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of: 
0.3345638 = queryWeight, product of: 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.06456205 = queryNorm 
0.035424173 = fieldWeight in 104, product of: 
.1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 
.5.18205 = idf(docFreq=3, maxDocs=262) 
.0.0068359375 = fieldNorm(doc=104) 
.0.2857143 = coord(2/7)

In this case, the document matches two clauses of the query -- 
"description:big" and "title:big".  The term description:big is matched 3 
times (termFreq) in this document, and evidently exists in only 3 
documents in the index (docFreq) but the fieldNorm is penalizing the 
overall scores.  Likewise the term title:big is matched 1 time, and exists 
in only 3 documents in your index -- the fieldNorm is slightly higher 
(probably due to the shorter length of the title).  The overall score of 
the second doc is penalized for only matching 2 of the 7 clauses.

Based on what i'm seeing here, the biggest suprise i have is the fieldNorm 
values you are getting -- they don't make sense given the lengths of the 
fields you showed us in the output unless some index time document (or 
field) boosts are getting applied -- perhaps intended to "promote" the 
"home.html" page in your search results?  My guess is a some setting in 
your CMS is doing this?  maybe based on "page depth" or something like 
that?

Based on your configs, I'm guessing you're running Solr 4.2 -- So I tried 
loading up copies of those 2 documents using the config+schema you 
provided, and here are the score explanations i got...

**NOTE** Things like the docFreqs (and therfore queryWeight & 
fieldWeight) are NOT going to be comparable because my index *only* had 
those two documents ... the key here is to compare the fieldNorms below 
with the fieldNorms from the same documents in your query...


http://www-a4.staging.mcdonalds.com/us/en/home.html
0.004108005 = (MATCH) product of:
.0.028756034 = (MATCH) sum of:
..0.028756034 = (MATCH) weight(keywords:big in 0) [DefaultSimilarity],
...0.028756034 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
0.2629123 = queryWeight, product of:
.1.0 = idf(docFreq=1, maxDocs=2)
.0.2629123 

RE: Why do documents without the search query term rank highest

2015-12-01 Thread Scotten Stuart
Thank you!


Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com




-Original Message-
From: Upayavira [mailto:u...@odoko.co.uk]
Sent: Tuesday, December 01, 2015 10:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Why do documents without the search query term rank highest

I would suggest you ask on a forum related to Adobe CQ. There are many ways in 
which CQ could be issuing queries against Solr, and without insight into that, 
people here aren't that likely to be able to help you
- unless they happen to also use CQ, which probably amounts to a very small 
portion of this community.

Upayavira

On Tue, Dec 1, 2015, at 04:36 PM, Scotten Stuart wrote:
> Hi All,
>
> I hope this is the way to ask a question - please guide me if there is
> a different protocol
>
> I have a question about results ranking for Solr V4.2 in combination
> with the CMS tool Adobe CQ (V5.6).
>
> Despite trying different ways to configure the ranking of documents I
> am confused why content that does not have even one mention of the
> search query ranks higher than documents that are actually titled with
> the search query.
>
> For example, searching for 'big' bring back 'Home' as the top result
> and 'Big Mac as the second result - see here
> http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?s
> earch=simple=burger=usmcd
>
>
> Any thoughts would be very welcome
>
> Thanks
> Stuart
> PMP, Business Technical Analyst | CRS Consultant | Corporate IT
> Digital | McDonald's Corporation
> 2111 McDonald's Drive | Oak Brook, IL 60523 USA
> Office: +1 630.623.5950 | Cell: 301.633.3298 |
> stuart.scot...@us.mcd.com
>
>
>
> 
>
> The information contained in this e-mail and any accompanying
> documents is confidential, may be privileged, and is intended solely
> for the person and/or entity to whom it is addressed (i.e. those identified 
> in the "To"
> and "cc" box). They are the property of McDonald's Corporation.
> Unauthorized review, use, disclosure, or copying of this
> communication, or any part thereof, is strictly prohibited and may be
> unlawful. If you have received this e-mail in error, please return the
> e-mail and attachments to the sender and delete the e-mail and
> attachments and any copy from your system. McDonald's thanks you for your 
> cooperation.



The information contained in this e-mail and any accompanying documents is 
confidential, may be privileged, and is intended solely for the person and/or 
entity to whom it is addressed (i.e. those identified in the "To" and "cc" 
box). They are the property of McDonald's Corporation. Unauthorized review, 
use, disclosure, or copying of this communication, or any part thereof, is 
strictly prohibited and may be unlawful. If you have received this e-mail in 
error, please return the e-mail and attachments to the sender and delete the 
e-mail and attachments and any copy from your system. McDonald's thanks you for 
your cooperation.


Re: Why do documents without the search query term rank highest

2015-12-01 Thread Chris Hostetter

: I would suggest you ask on a forum related to Adobe CQ. There are many
: ways in which CQ could be issuing queries against Solr, and without
: insight into that, people here aren't that likely to be able to help you
: - unless they happen to also use CQ, which probably amounts to a very
: small portion of this community.

If you have access to the Solr logs, and can provide us with the 
configs, schema, and requests being made by your frontend, then folks 
might be able to help explain the results.

In particular, if you can figure out what query the front end is making, 
then make that same query with "debug=true" added to the request, and 
provide that entire output here, then that will help explain everything 
about the queries being executed and wy the results are getting the 
scores hey have...

https://wiki.apache.org/solr/UsingMailingLists


: > For example, searching for 'big' bring back 'Home' as the top result 
: > and 'Big Mac as the second result - see here 
: > 
http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd

FWIW: That URL doesn't do a search for "big" ... pretty sure you ment...

http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd



-Hoss
http://www.lucidworks.com/


Re: Why do documents without the search query term rank highest

2015-12-01 Thread Upayavira
I would suggest you ask on a forum related to Adobe CQ. There are many
ways in which CQ could be issuing queries against Solr, and without
insight into that, people here aren't that likely to be able to help you
- unless they happen to also use CQ, which probably amounts to a very
small portion of this community.

Upayavira

On Tue, Dec 1, 2015, at 04:36 PM, Scotten Stuart wrote:
> Hi All,
> 
> I hope this is the way to ask a question - please guide me if there is a
> different protocol
> 
> I have a question about results ranking for Solr V4.2 in combination with
> the CMS tool Adobe CQ (V5.6).
> 
> Despite trying different ways to configure the ranking of documents I am
> confused why content that does not have even one mention of the search
> query ranks higher than documents that are actually titled with the
> search query.
> 
> For example, searching for 'big' bring back 'Home' as the top result and
> 'Big Mac as the second result - see here
> http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd
> 
> 
> Any thoughts would be very welcome
> 
> Thanks
> Stuart
> PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital |
> McDonald's Corporation
> 2111 McDonald's Drive | Oak Brook, IL 60523 USA
> Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com
> 
> 
> 
> 
> 
> The information contained in this e-mail and any accompanying documents
> is confidential, may be privileged, and is intended solely for the person
> and/or entity to whom it is addressed (i.e. those identified in the "To"
> and "cc" box). They are the property of McDonald's Corporation.
> Unauthorized review, use, disclosure, or copying of this communication,
> or any part thereof, is strictly prohibited and may be unlawful. If you
> have received this e-mail in error, please return the e-mail and
> attachments to the sender and delete the e-mail and attachments and any
> copy from your system. McDonald's thanks you for your cooperation.


RE: Why do documents without the search query term rank highest

2015-12-01 Thread Scotten Stuart
Thank you for the feedback - I will need some time to put together the response 
to your suggestions - and you're right, I did get the search URL wrong - just a 
beginner at this!


Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com




-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
Sent: Tuesday, December 01, 2015 10:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Why do documents without the search query term rank highest


: I would suggest you ask on a forum related to Adobe CQ. There are many
: ways in which CQ could be issuing queries against Solr, and without
: insight into that, people here aren't that likely to be able to help you
: - unless they happen to also use CQ, which probably amounts to a very
: small portion of this community.

If you have access to the Solr logs, and can provide us with the configs, 
schema, and requests being made by your frontend, then folks might be able to 
help explain the results.

In particular, if you can figure out what query the front end is making, then 
make that same query with "debug=true" added to the request, and provide that 
entire output here, then that will help explain everything about the queries 
being executed and wy the results are getting the scores hey have...

https://wiki.apache.org/solr/UsingMailingLists


: > For example, searching for 'big' bring back 'Home' as the top result
: > and 'Big Mac as the second result - see here
: > 
http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd

FWIW: That URL doesn't do a search for "big" ... pretty sure you ment...

http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd



-Hoss
http://www.lucidworks.com/



The information contained in this e-mail and any accompanying documents is 
confidential, may be privileged, and is intended solely for the person and/or 
entity to whom it is addressed (i.e. those identified in the "To" and "cc" 
box). They are the property of McDonald's Corporation. Unauthorized review, 
use, disclosure, or copying of this communication, or any part thereof, is 
strictly prohibited and may be unlawful. If you have received this e-mail in 
error, please return the e-mail and attachments to the sender and delete the 
e-mail and attachments and any copy from your system. McDonald's thanks you for 
your cooperation.


RE: Why do documents without the search query term rank highest

2015-12-01 Thread Scotten Stuart
s=262) 0.06456205 = queryNorm 0.0026707677 = fieldWeight 
in 66, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 
5.4697323 = idf(docFreq=2, maxDocs=262) 4.8828125E-4 = fieldNorm(doc=66) 
0.0016930922 = (MATCH) weight(title:big in 66) [DefaultSimilarity], result of: 
0.0016930922 = score(doc=66,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = 
queryWeight, product of: 5.18205 = idf(docFreq=3, maxDocs=262) 0.06456205 = 
queryNorm 0.005060596 = fieldWeight in 66, product of: 1.0 = tf(freq=1.0), with 
freq of: 1.0 = termFreq=1.0 5.18205 = idf(docFreq=3, maxDocs=262) 9.765625E-4 = 
fieldNorm(doc=66) 0.42857143 = coord(3/7)

http://www-a4.staging.mcdonalds.com/us/en/food/food_quality/see_what_we_are_made_of/meet_our_suppliers/keystone_foods.html;>
8.465462E-4 = (MATCH) product of: 0.005925823 = (MATCH) sum of: 0.005925823 = 
(MATCH) weight(keywords:big in 28) [DefaultSimilarity], result of: 0.005925823 
= score(doc=28,freq=1.0 = termFreq=1.0 ), product of: 0.3345638 = queryWeight, 
product of: 5.18205 = idf(docFreq=3, maxDocs=262) 0.06456205 = queryNorm 
0.017712086 = fieldWeight in 28, product of: 1.0 = tf(freq=1.0), with freq of: 
1.0 = termFreq=1.0 5.18205 = idf(docFreq=3, maxDocs=262) 0.0034179688 = 
fieldNorm(doc=28) 0.14285715 = coord(1/7)


LuceneQParser

4.0

1.0

0.0


1.0


0.0


0.0


0.0


0.0


0.0


0.0



3.0

0.0


0.0


0.0


0.0


0.0


0.0


0.0


3.0








Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com




-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
Sent: Tuesday, December 01, 2015 10:54 AM
To: solr-user@lucene.apache.org
Subject: Re: Why do documents without the search query term rank highest


: I would suggest you ask on a forum related to Adobe CQ. There are many
: ways in which CQ could be issuing queries against Solr, and without
: insight into that, people here aren't that likely to be able to help you
: - unless they happen to also use CQ, which probably amounts to a very
: small portion of this community.

If you have access to the Solr logs, and can provide us with the configs, 
schema, and requests being made by your frontend, then folks might be able to 
help explain the results.

In particular, if you can figure out what query the front end is making, then 
make that same query with "debug=true" added to the request, and provide that 
entire output here, then that will help explain everything about the queries 
being executed and wy the results are getting the scores hey have...

https://wiki.apache.org/solr/UsingMailingLists


: > For example, searching for 'big' bring back 'Home' as the top result
: > and 'Big Mac as the second result - see here
: > 
http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd

FWIW: That URL doesn't do a search for "big" ... pretty sure you ment...

http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=big=usmcd



-Hoss
http://www.lucidworks.com/



The information contained in this e-mail and any accompanying documents is 
confidential, may be privileged, and is intended solely for the person and/or 
entity to whom it is addressed (i.e. those identified in the "To" and "cc" 
box). They are the property of McDonald's Corporation. Unauthorized review, 
use, disclosure, or copying of this communication, or any part thereof, is 
strictly prohibited and may be unlawful. If you have received this e-mail in 
error, please return the e-mail and attachments to the sender and delete the 
e-mail and attachments and any copy from your system. McDonald's thanks you for 
your cooperation.


config.docx
Description: config.docx


schema.docx
Description: schema.docx


Why do documents without the search query term rank highest

2015-12-01 Thread Scotten Stuart
Hi All,

I hope this is the way to ask a question - please guide me if there is a 
different protocol

I have a question about results ranking for Solr V4.2 in combination with the 
CMS tool Adobe CQ (V5.6).

Despite trying different ways to configure the ranking of documents I am 
confused why content that does not have even one mention of the search query 
ranks higher than documents that are actually titled with the search query.

For example, searching for 'big' bring back 'Home' as the top result and 'Big 
Mac as the second result - see here 
http://www-a4.staging.mcdonalds.com/us/en/search/search_results.html?search=simple=burger=usmcd


Any thoughts would be very welcome

Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com





The information contained in this e-mail and any accompanying documents is 
confidential, may be privileged, and is intended solely for the person and/or 
entity to whom it is addressed (i.e. those identified in the "To" and "cc" 
box). They are the property of McDonald's Corporation. Unauthorized review, 
use, disclosure, or copying of this communication, or any part thereof, is 
strictly prohibited and may be unlawful. If you have received this e-mail in 
error, please return the e-mail and attachments to the sender and delete the 
e-mail and attachments and any copy from your system. McDonald's thanks you for 
your cooperation.