Re: Boost Strangeness

2011-06-16 Thread Judioo
fascinating

Thank you so much Erik, I'm slowly beginning to understand.

SO I've discovered that by defining 'splitOnNumerics=0' on the filter
class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
can get *closer* to my required goal!

Now something else odd is occuring.

It only returns 2 results where there is over 70?

Why is that? I can't find were this is explained :(

query

/solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true

output

{

   - -
   responseHeader: {
  - status: 0
  - QTime: 51
  - -
  params: {
 - debugQuery: on
 - fl:
 
type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
 - indent: on
 - q: b006m86d
 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
 subseries_container_id^8 clip_container_id^1 clip_episode_id^1
 - wt: json
 - -
 omitNorms: [
- true
- true
 ]
 - defType: dismax
  }
   }
   - -
   response: {
  - numFound: 2
  - start: 0
  - maxScore: 13.473297
  - -
  docs: [
 - -
 {
- parent_id: 
- id: b006m86d
- type: brand
- score: 13.473297
 }
 - -
 {
- series_container_id: 
- id: b00y1w9h
- type: episode
- brand_container_id: b006m86d
- subseries_container_id: 
- clip_episode_id: 
- score: 11.437143
 }
  ]
   }
   - -
   debug: {
  - rawquerystring: b006m86d
  - querystring: b006m86d
  - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 |
  clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
  series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
  brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()
  - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d
  | subseries_container_id:b006m86d^8.0 |
series_container_id:b006m86d^8.0 |
  clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
  parent_id:b006m86d^9.0) ()
  - -
  explain: {
 - b006m86d:  13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
 of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
product of: 1.0 =
 tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
maxDocs=783800) 1.0 =
 fieldNorm(field=id, doc=27636) 
 - b00y1w9h:  11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
 of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
 product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
 product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
 0.007422088 = queryNorm 13.878762 = (MATCH)
 fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
 tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
 maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) 
  }
  - QParser: DisMaxQParser
  - altquerystring: null
  - boostfuncs: null
  - -
  timing: {
 - time: 51
 - -
 prepare: {
- time: 6
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 5
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 1
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.DebugComponent: {
   - time: 0
}
 }
 - -
 process: {
- time: 45
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 27
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   

Re: Boost Strangeness

2011-06-16 Thread Erick Erickson
Right, if you've only changed WordDelimiterFilterFactory in the query, then
then tokens you're analyzing may be split up. Try running some of the
terms through the admin/analysis page Unless you have
catenateAll=1, in the definition, the whole term won't be there

It becomes a question of why you even want WDFF in there in the first
place, do you ever want to split these fields up this way? Maybe start
by just taking it out completely?

Best
Erick

On Thu, Jun 16, 2011 at 9:55 AM, Judioo cont...@judioo.com wrote:
 fascinating

 Thank you so much Erik, I'm slowly beginning to understand.

 SO I've discovered that by defining 'splitOnNumerics=0' on the filter
 class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
 can get *closer* to my required goal!

 Now something else odd is occuring.

 It only returns 2 results where there is over 70?

 Why is that? I can't find were this is explained :(

 query

 /solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true

 output

 {

   - -
   responseHeader: {
      - status: 0
      - QTime: 51
      - -
      params: {
         - debugQuery: on
         - fl:
         
 type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
         - indent: on
         - q: b006m86d
         - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1
         - wt: json
         - -
         omitNorms: [
            - true
            - true
         ]
         - defType: dismax
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 13.473297
      - -
      docs: [
         - -
         {
            - parent_id: 
            - id: b006m86d
            - type: brand
            - score: 13.473297
         }
         - -
         {
            - series_container_id: 
            - id: b00y1w9h
            - type: episode
            - brand_container_id: b006m86d
            - subseries_container_id: 
            - clip_episode_id: 
            - score: 11.437143
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: b006m86d
      - querystring: b006m86d
      - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 |
      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()
      - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d
      | subseries_container_id:b006m86d^8.0 |
 series_container_id:b006m86d^8.0 |
      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
      parent_id:b006m86d^9.0) ()
      - -
      explain: {
         - b006m86d:  13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
 product of: 1.0 =
         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
 maxDocs=783800) 1.0 =
         fieldNorm(field=id, doc=27636) 
         - b00y1w9h:  11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
         0.007422088 = queryNorm 13.878762 = (MATCH)
         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) 
      }
      - QParser: DisMaxQParser
      - altquerystring: null
      - boostfuncs: null
      - -
      timing: {
         - time: 51
         - -
         prepare: {
            - time: 6
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 5
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 1
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 45
            - -
            

Re: Boost Strangeness

2011-06-15 Thread Ahmet Arslan
 I have 2 document types but want to return any documents
 where the requested
 ID appears. The ID appears in multiple attributes but I
 want to boost
 results based on which attribute contains the ID.
 
 so my query is
 
 q=id:b007vty6 parent_id:b007vty6
 brand_container_id:b007vty6
 series_container_id:b007vty6
 subseries_container_id:b007vty6
 clip_container_id:b007vty6 clip_episode_id:b007vty6
 
 and I use qf to boost fields
 
 qf=id^10 parent_id^9 brand_container_id^8
 series_container_id^8
 subseries_container_id^8 clip_container_id^1
 clip_episode_id^1
 

There is a misunderstanding here. qf parameter is specific to (e)dismax query 
parser plugin. For more information about it please see:

http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

Your query string can be something like this:

defType=dismaxq=b007vty6qf=id^10 parent_id^9 brand_container_id^8 ...

It automatically expands your simple word query to multiple fields.
defType=dismax is a must to enable it, either in URL or in solrconfig.xml 
(defaults section).


Re: Boost Strangeness

2011-06-15 Thread Judioo
Apologies
I have tried that method as well.

/solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on


same result ( just higher scores ). It's almost as if  partial matches on
brand|series_container_id and id are being considered in the 1st document.
Surely this can't be right / expected?

{

   - -
   responseHeader: {
  - status: 0
  - QTime: 13
  - -
  params: {
 - debugQuery: on
 - fl:
 
id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
 - indent: on
 - q: b007vty6
 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
 subseries_container_id^8 clip_container_id^1 clip_episode_id^1
 - wt: json
 - defType: dismax
  }
   }
   - -
   response: {
  - numFound: 2
  - start: 0
  - maxScore: 21.138214
  - -
  docs: [
 - -
 {
- series_container_id: b007vm94
- id: b007vsvm
- brand_container_id: b007hhk5
- subseries_container_id: b007vty6
- clip_episode_id: 
- score: 21.138214
 }
 - -
 {
- parent_id: b007vm94
- id: b007vty6
- score: 5.1243143
 }
  ]
   }
   - -
   debug: {
  - rawquerystring: b007vty6
  - querystring: b007vty6
  - parsedquery: +DisjunctionMaxQuery((id:b007vty6^10.0 |
  clip_episode_id:b 007 vty 6 | subseries_container_id:b 007
vty 6^8.0 |
  series_container_id:b 007 vty 6^8.0 | clip_container_id:b 007 vty 6 |
  brand_container_id:b 007 vty 6^8.0 | parent_id:b 007 vty 6^9.0)) ()
  - parsedquery_toString: +(id:b007vty6^10.0 | clip_episode_id:b 007
  vty 6 | subseries_container_id:b 007 vty 6^8.0 |
series_container_id:b
  007 vty 6^8.0 | clip_container_id:b 007 vty 6 |
brand_container_id:b 007
  vty 6^8.0 | parent_id:b 007 vty 6^9.0) ()
  - -
  explain: {
 - b007vsvm:  21.138214 = (MATCH) sum of: 21.138214 = (MATCH) max
 of: 21.138214 = (MATCH) weight(subseries_container_id:b 007
vty 6^8.0 in
 39526), product of: 0.85312855 =
queryWeight(subseries_container_id:b 007
 vty 6^8.0), product of: 8.0 = boost 49.55458 =
idf(subseries_container_id:
 b=547 007=31 vty=1 6=87) 0.0021519922 = queryNorm 24.77729 =
 fieldWeight(subseries_container_id:b 007 vty 6 in 39526),
product of: 1.0
 = tf(phraseFreq=1.0) 49.55458 = idf(subseries_container_id:
b=547 007=31
 vty=1 6=87) 0.5 = fieldNorm(field=subseries_container_id, doc=39526) 
 - b007vty6:  5.1243143 = (MATCH) sum of: 5.1243143 = (MATCH) max
 of: 5.1243143 = (MATCH) weight(id:b007vty6^10.0 in 39512), product of:
 0.33207658 = queryWeight(id:b007vty6^10.0), product of: 10.0 = boost
 15.431123 = idf(docFreq=1, maxDocs=3701577) 0.0021519922 = queryNorm
 15.431123 = (MATCH) fieldWeight(id:b007vty6 in 39512),
product of: 1.0 =
 tf(termFreq(id:b007vty6)=1) 15.431123 = idf(docFreq=1,
maxDocs=3701577) 1.0
 = fieldNorm(field=id, doc=39512) 
  }
  - QParser: DisMaxQParser
  - altquerystring: null
  - boostfuncs: null
  - -
  timing: {
 - time: 13
 - -
 prepare: {
- time: 3
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 3
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.DebugComponent: {
   - time: 0
}
 }
 - -
 process: {
- time: 10
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0

Re: Boost Strangeness

2011-06-15 Thread Ahmet Arslan
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
 same result ( just higher scores ). It's almost as if 
 partial matches on
 brand|series_container_id and id are being considered in
 the 1st document.
 Surely this can't be right / expected?

What is your fieldType definition? Don't you think it is better to use string 
type which is not tokenized?


Re: Boost Strangeness

2011-06-15 Thread Judioo
   dynamicField name=*_id  type=textindexed=true  stored=true/

so all attributes except 'id' are of type text.

I didn't know that about the string type. So is my problem as described (
that partial matches are contributing to the calculation ) and does defining
the filed type as string solve this problem.

Or is my understanding completely incorrect?

Thanks in advance

On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:

 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
  same result ( just higher scores ). It's almost as if
  partial matches on
  brand|series_container_id and id are being considered in
  the 1st document.
  Surely this can't be right / expected?

 What is your fieldType definition? Don't you think it is better to use
 string type which is not tokenized?



Re: Boost Strangeness

2011-06-15 Thread Judioo
String also does not seem to accept spaces. currently the _id fields can
contain multiple ids ( using as a multiType alternative ). This is why I
used the text type.

On 15 June 2011 12:16, Judioo cont...@judioo.com wrote:

dynamicField name=*_id  type=textindexed=true
 stored=true/

 so all attributes except 'id' are of type text.

 I didn't know that about the string type. So is my problem as described (
 that partial matches are contributing to the calculation ) and does defining
 the filed type as string solve this problem.

 Or is my understanding completely incorrect?

 Thanks in advance


 On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:

 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
  same result ( just higher scores ). It's almost as if
  partial matches on
  brand|series_container_id and id are being considered in
  the 1st document.
  Surely this can't be right / expected?

 What is your fieldType definition? Don't you think it is better to use
 string type which is not tokenized?





Re: Boost Strangeness

2011-06-15 Thread Erick Erickson
First off, you didn't violate groups ettiquette. In fact, yours was
one of the better first posts in terms or providing enough information
for us to actually help!

A very useful page is the admin/analysis page to see how the
analysis chain works. For instance, if you haven't changed the
field type (i.e. fieldType name=text) that your input is
being broken up by WordDelimiterFilterFactory. Be sure to check
the verbose checkbox and enter text in both the query and
index boxes!

Here's an invaluable page, though do note that it's not exhaustive:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


But on to your problem:

First, boosting isn't absolute, boosting terms just tends to
bubble things up, you have to experiment with various weights

To get the full comparison for both documents you're curious about,
try using explainOther. see:
http://wiki.apache.org/solr/SolrRelevancyFAQ#Why_doesn.27t_document_id:juggernaut_appear_in_the_top_10_results_for_my_query

If you use that against the two docs in question, you should
see (although it's a hard read!) the reason the docs got
their relative scores.

Finally, your next e-mail hints at what's happening. If you're
putting multiple tokens in some of these fields, the length
normalization may be causing the matches to score lower. You can
try disabling those calculations (omitNorms=true in your field definition).
See:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

String types accept spaces just fine, but you might want to define
the fields with 'multiValued=true ' and index each as a separate
field (note that won't work with a field that's also your uniqueKey).

Best
Erick

On Wed, Jun 15, 2011 at 7:16 AM, Judioo cont...@judioo.com wrote:
   dynamicField name=*_id  type=text    indexed=true  stored=true/

 so all attributes except 'id' are of type text.

 I didn't know that about the string type. So is my problem as described (
 that partial matches are contributing to the calculation ) and does defining
 the filed type as string solve this problem.

 Or is my understanding completely incorrect?

 Thanks in advance

 On 15 June 2011 12:08, Ahmet Arslan iori...@yahoo.com wrote:

 
 /solr/select/?q=b007vty6defType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=on
 
 
  same result ( just higher scores ). It's almost as if
  partial matches on
  brand|series_container_id and id are being considered in
  the 1st document.
  Surely this can't be right / expected?

 What is your fieldType definition? Don't you think it is better to use
 string type which is not tokenized?