Re: per-fieldtype similarity not working

2013-03-29 Thread mike.vogel
Any example or suggestion for how to patch the wrapper so that coord method
is called for the field type with the custom similarity?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/per-fieldtype-similarity-not-working-tp3987050p4052470.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: per-fieldtype similarity not working

2012-06-08 Thread Markus Jelsma
Thanks Robert,

The difference in scores is clear now so it shouldn't matter as queryNorm 
doesn't affect ranking but coord does. Can you explain why coord is left out 
now and why it is considered to skew results and why queryNorm skews results? 
And which specific new ranking algorithms they confuse, BM25F? 

Also, i would expect the default SchemaSimilarityFactory to behave the same as 
DefaultSimilarity this might raise some further confusion down the line.

I'll open an issue for the lack of Similarity impl. in the debug output when 
per-field similarity is enabled.

Cheers!

 
 
-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Fri 01-Jun-2012 18:16
 To: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Hi!
 
 
  Ah, it makes sense now! This global configured similarity knows returns a 
  fieldType defined similarity if available and if not the standard Lucene 
  similarity. This would, i assume, mean that the two defined similarities 
  below without per fieldType declared similarities would always yield the 
  same results?
 
 Not true: note that two methods (coord and querynorm) are not perfield
 but global across the entire query tree.
 
 By default these are disabled in the wrapper, as they only skew or
 confuse most modern scoring algorithms (eg all the new ranking
 algorithms in lucene 4) respectively.
 
 So if you want to do per-field scoring where *all* of your sims are
 vector-space, it could make sense to customize (e.g. subclass)
 SchemaSimilarityFactory and do something useful for these methods.
 
 
 -- 
 lucidimagination.com
 


Re: per-fieldtype similarity not working

2012-06-08 Thread Robert Muir
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks Robert,

 The difference in scores is clear now so it shouldn't matter as queryNorm 
 doesn't affect ranking but coord does. Can you explain why coord is left out 
 now and why it is considered to skew results and why queryNorm skews results? 
 And which specific new ranking algorithms they confuse, BM25F?

I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.

You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100


 Also, i would expect the default SchemaSimilarityFactory to behave the same 
 as DefaultSimilarity this might raise some further confusion down the line.

Thats ok: I'd rather the very expert case (Per-Field scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

-- 
lucidimagination.com


RE: per-fieldtype similarity not working

2012-06-08 Thread Markus Jelsma
Excellent!
Thanks

 
 
-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Fri 08-Jun-2012 13:06
 To: Markus Jelsma markus.jel...@openindex.io
 Cc: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Thanks Robert,
 
  The difference in scores is clear now so it shouldn't matter as queryNorm 
  doesn't affect ranking but coord does. Can you explain why coord is left 
  out now and why it is considered to skew results and why queryNorm skews 
  results? And which specific new ranking algorithms they confuse, BM25F?
 
 I think its easiest to compare the two TF normalization functions,
 DefaultSimilarity really needs something like this because its
 function (sqrt) grows very fast for a single term.
 On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
 rather quickly for a single term, so when multiple terms are being
 scored, huge numbers of occurrences of a single term won't dominate
 the overall score.
 
 You can see this visually here (give it a second to load, and imagine
 documentLength = averageDocumentLength and k=1.2):
 http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100
 
 
  Also, i would expect the default SchemaSimilarityFactory to behave the same 
  as DefaultSimilarity this might raise some further confusion down the line.
 
 Thats ok: I'd rather the very expert case (Per-Field scoring) be
 trickier than have a trap for people that try to use any algorithm
 other than TFIDFSimilarity
 
 -- 
 lucidimagination.com
 


RE: per-fieldtype similarity not working

2012-06-01 Thread Markus Jelsma
Thanks but i am clearly missing something? We declare the similarity in the 
fieldType just as in the example and looking at the example again i don't see 
how it's being done differently. What am i missnig and where do i miss it? :)

-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Thu 31-May-2012 17:47
 To: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 
  We simply declare the following in our fieldType:
  similarity class=FQCN/
 
 
 Thats not enough, see the example:
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml
 
 
 -- 
 lucidimagination.com
 


Re: per-fieldtype similarity not working

2012-06-01 Thread Robert Muir
On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks but i am clearly missing something? We declare the similarity in the 
 fieldType just as in the example and looking at the example again i don't see 
 how it's being done differently. What am i missnig and where do i miss it? :)


Hi Markus, checkout the last line at the bottom:
 !-- default similarity, defers to the fieldType --
 similarity class=solr.SchemaSimilarityFactory/

When this is set, it means IndexSearcher/IndexWriter use a
PerFieldSimilarityWrapper that delegates based to the Solr schema
fieldtype.

Note this is just a simple ordinary similarity impl
(http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java),
you could also write your own that works differently.

-- 
lucidimagination.com


RE: per-fieldtype similarity not working

2012-06-01 Thread Markus Jelsma
Hi!


Ah, it makes sense now! This global configured similarity knows returns a 
fieldType defined similarity if available and if not the standard Lucene 
similarity. This would, i assume, mean that the two defined similarities below 
without per fieldType declared similarities would always yield the same results?

similarity class=org.apache.lucene.search.similarities.DefaultSimilarity/
similarity class=solr.SchemaSimilarityFactory/

I would assume because without per fieldType declared the 
SchemaSimilarityFactory returns the default lucene Similarity. However, when 
checking out it doesn't work for my url field but does work for the content and 
title field. I have defined the same similarity for the url fieldType as i did 
for the title fieldType. This is the output for solr.SchemaSimilarityFactory 
without per-field declared: 

  38.565483 = (MATCH) max plus 0.27 times others of:
5.434552 = (MATCH) weight(content:groning^1.4 in 384) [], result of:
  5.434552 = score(doc=384,freq=10.0 = termFreq=10.0
), product of:
1.5511217 = queryWeight, product of:
  1.4 = boost
  1.1079441 = idf(docFreq=1236, maxDocs=1378)
  1.0 = queryNorm
3.503627 = fieldWeight in 384, product of:
  3.1622777 = tf(freq=10.0), with freq of:
10.0 = termFreq=10.0
  1.1079441 = idf(docFreq=1236, maxDocs=1378)
  1.0 = fieldNorm(doc=384)
4.38 = (MATCH) weight(title:groning^4.7 in 384) [], result of:
  4.38 = score(doc=384,freq=2.0 = termFreq=2.0
), product of:
5.346149 = queryWeight, product of:
  4.7 = boost
  1.1374786 = idf(docFreq=1200, maxDocs=1378)
  1.0 = queryNorm
0.8043188 = fieldWeight in 384, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  1.1374786 = idf(docFreq=1200, maxDocs=1378)
  0.5 = fieldNorm(doc=384)
35.937153 = (MATCH) weight(url:groning^2.1 in 384) [], result of:
  35.937153 = score(doc=384,freq=1.0 = termFreq=1.0
), product of:
10.988577 = queryWeight, product of:
  2.1 = boost
  5.232656 = idf(docFreq=19, maxDocs=1378)
  1.0 = queryNorm
3.27041 = fieldWeight in 384, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  5.232656 = idf(docFreq=19, maxDocs=1378)
  0.625 = fieldNorm(doc=384)


Here's the output with DefaultSimilarity declared:

  3.2723136 = (MATCH) max plus 0.27 times others of:
0.46112633 = (MATCH) weight(content:groning^1.4 in 327) 
[DefaultSimilarity], result of:
  0.46112633 = score(doc=327,freq=10.0 = termFreq=10.0
), product of:
0.13161398 = queryWeight, product of:
  1.4 = boost
  1.1079441 = idf(docFreq=1236, maxDocs=1378)
  0.08485084 = queryNorm
3.503627 = fieldWeight in 327, product of:
  3.1622777 = tf(freq=10.0), with freq of:
10.0 = termFreq=10.0
  1.1079441 = idf(docFreq=1236, maxDocs=1378)
  1.0 = fieldNorm(doc=327)
0.36485928 = (MATCH) weight(title:groning^4.7 in 327) [DefaultSimilarity], 
result of:
  0.36485928 = score(doc=327,freq=2.0 = termFreq=2.0
), product of:
0.45362523 = queryWeight, product of:
  4.7 = boost
  1.1374786 = idf(docFreq=1200, maxDocs=1378)
  0.08485084 = queryNorm
0.8043188 = fieldWeight in 327, product of:
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
  1.1374786 = idf(docFreq=1200, maxDocs=1378)
  0.5 = fieldNorm(doc=327)
3.0492976 = (MATCH) weight(url:groning^2.1 in 327) [DefaultSimilarity], 
result of:It also seems the debug output is wrong, it does not write the 
similarity classname between [] and produces an empty [] for each match.
  3.0492976 = score(doc=327,freq=1.0 = termFreq=1.0
), product of:
0.93239 = queryWeight, product of:
  2.1 = boost
  5.232656 = idf(docFreq=19, maxDocs=1378)
  0.08485084 = queryNorm
3.27041 = fieldWeight in 327, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  5.232656 = idf(docFreq=19, maxDocs=1378)
  0.625 = fieldNorm(doc=327)

How can i explain the difference? Also, with the factory declared, the score of 
the url field is still the same, it does not seem to listen to the per-field 
declared similarity. It also seems the debug output is wrong, it does not write 
the similarity classname between [] and produces an empty [] for each match.

Many thanks and a nice weekend!
Markus
 
 
-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Fri 01-Jun-2012 17:00
 To: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Thanks but i am clearly missing something? We declare the similarity in the 
  fieldType just

Re: per-fieldtype similarity not working

2012-06-01 Thread Robert Muir
On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi!


 Ah, it makes sense now! This global configured similarity knows returns a 
 fieldType defined similarity if available and if not the standard Lucene 
 similarity. This would, i assume, mean that the two defined similarities 
 below without per fieldType declared similarities would always yield the same 
 results?

Not true: note that two methods (coord and querynorm) are not perfield
but global across the entire query tree.

By default these are disabled in the wrapper, as they only skew or
confuse most modern scoring algorithms (eg all the new ranking
algorithms in lucene 4) respectively.

So if you want to do per-field scoring where *all* of your sims are
vector-space, it could make sense to customize (e.g. subclass)
SchemaSimilarityFactory and do something useful for these methods.


-- 
lucidimagination.com


per-fieldtype similarity not working

2012-05-31 Thread Markus Jelsma
Hi,

We intend to use different similarity implemenations for some field types 
configured according to SOLR-2338. I doubled checked with the schema in 
test-files and everything seems fine. However, the result is not correct and 
debugQuery shows the default configured similarity implementation is being used.

We simply declare the following in our fieldType:
similarity class=FQCN/


Thanks,
Markus


Re: per-fieldtype similarity not working

2012-05-31 Thread Robert Muir
On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma
markus.jel...@openindex.io wrote:

 We simply declare the following in our fieldType:
 similarity class=FQCN/


Thats not enough, see the example:
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml


-- 
lucidimagination.com