Re: per-fieldtype similarity not working
Any example or suggestion for how to patch the wrapper so that coord method is called for the field type with the custom similarity? -- View this message in context: http://lucene.472066.n3.nabble.com/per-fieldtype-similarity-not-working-tp3987050p4052470.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: per-fieldtype similarity not working
Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. I'll open an issue for the lack of Similarity impl. in the debug output when per-field similarity is enabled. Cheers! -Original message- From:Robert Muir rcm...@gmail.com Sent: Fri 01-Jun-2012 18:16 To: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi! Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results? Not true: note that two methods (coord and querynorm) are not perfield but global across the entire query tree. By default these are disabled in the wrapper, as they only skew or confuse most modern scoring algorithms (eg all the new ranking algorithms in lucene 4) respectively. So if you want to do per-field scoring where *all* of your sims are vector-space, it could make sense to customize (e.g. subclass) SchemaSimilarityFactory and do something useful for these methods. -- lucidimagination.com
Re: per-fieldtype similarity not working
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com
RE: per-fieldtype similarity not working
Excellent! Thanks -Original message- From:Robert Muir rcm...@gmail.com Sent: Fri 08-Jun-2012 13:06 To: Markus Jelsma markus.jel...@openindex.io Cc: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com
RE: per-fieldtype similarity not working
Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :) -Original message- From:Robert Muir rcm...@gmail.com Sent: Thu 31-May-2012 17:47 To: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma markus.jel...@openindex.io wrote: We simply declare the following in our fieldType: similarity class=FQCN/ Thats not enough, see the example: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml -- lucidimagination.com
Re: per-fieldtype similarity not working
On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks but i am clearly missing something? We declare the similarity in the fieldType just as in the example and looking at the example again i don't see how it's being done differently. What am i missnig and where do i miss it? :) Hi Markus, checkout the last line at the bottom: !-- default similarity, defers to the fieldType -- similarity class=solr.SchemaSimilarityFactory/ When this is set, it means IndexSearcher/IndexWriter use a PerFieldSimilarityWrapper that delegates based to the Solr schema fieldtype. Note this is just a simple ordinary similarity impl (http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/similarities/SchemaSimilarityFactory.java), you could also write your own that works differently. -- lucidimagination.com
RE: per-fieldtype similarity not working
Hi! Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results? similarity class=org.apache.lucene.search.similarities.DefaultSimilarity/ similarity class=solr.SchemaSimilarityFactory/ I would assume because without per fieldType declared the SchemaSimilarityFactory returns the default lucene Similarity. However, when checking out it doesn't work for my url field but does work for the content and title field. I have defined the same similarity for the url fieldType as i did for the title fieldType. This is the output for solr.SchemaSimilarityFactory without per-field declared: 38.565483 = (MATCH) max plus 0.27 times others of: 5.434552 = (MATCH) weight(content:groning^1.4 in 384) [], result of: 5.434552 = score(doc=384,freq=10.0 = termFreq=10.0 ), product of: 1.5511217 = queryWeight, product of: 1.4 = boost 1.1079441 = idf(docFreq=1236, maxDocs=1378) 1.0 = queryNorm 3.503627 = fieldWeight in 384, product of: 3.1622777 = tf(freq=10.0), with freq of: 10.0 = termFreq=10.0 1.1079441 = idf(docFreq=1236, maxDocs=1378) 1.0 = fieldNorm(doc=384) 4.38 = (MATCH) weight(title:groning^4.7 in 384) [], result of: 4.38 = score(doc=384,freq=2.0 = termFreq=2.0 ), product of: 5.346149 = queryWeight, product of: 4.7 = boost 1.1374786 = idf(docFreq=1200, maxDocs=1378) 1.0 = queryNorm 0.8043188 = fieldWeight in 384, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 1.1374786 = idf(docFreq=1200, maxDocs=1378) 0.5 = fieldNorm(doc=384) 35.937153 = (MATCH) weight(url:groning^2.1 in 384) [], result of: 35.937153 = score(doc=384,freq=1.0 = termFreq=1.0 ), product of: 10.988577 = queryWeight, product of: 2.1 = boost 5.232656 = idf(docFreq=19, maxDocs=1378) 1.0 = queryNorm 3.27041 = fieldWeight in 384, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.232656 = idf(docFreq=19, maxDocs=1378) 0.625 = fieldNorm(doc=384) Here's the output with DefaultSimilarity declared: 3.2723136 = (MATCH) max plus 0.27 times others of: 0.46112633 = (MATCH) weight(content:groning^1.4 in 327) [DefaultSimilarity], result of: 0.46112633 = score(doc=327,freq=10.0 = termFreq=10.0 ), product of: 0.13161398 = queryWeight, product of: 1.4 = boost 1.1079441 = idf(docFreq=1236, maxDocs=1378) 0.08485084 = queryNorm 3.503627 = fieldWeight in 327, product of: 3.1622777 = tf(freq=10.0), with freq of: 10.0 = termFreq=10.0 1.1079441 = idf(docFreq=1236, maxDocs=1378) 1.0 = fieldNorm(doc=327) 0.36485928 = (MATCH) weight(title:groning^4.7 in 327) [DefaultSimilarity], result of: 0.36485928 = score(doc=327,freq=2.0 = termFreq=2.0 ), product of: 0.45362523 = queryWeight, product of: 4.7 = boost 1.1374786 = idf(docFreq=1200, maxDocs=1378) 0.08485084 = queryNorm 0.8043188 = fieldWeight in 327, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 1.1374786 = idf(docFreq=1200, maxDocs=1378) 0.5 = fieldNorm(doc=327) 3.0492976 = (MATCH) weight(url:groning^2.1 in 327) [DefaultSimilarity], result of:It also seems the debug output is wrong, it does not write the similarity classname between [] and produces an empty [] for each match. 3.0492976 = score(doc=327,freq=1.0 = termFreq=1.0 ), product of: 0.93239 = queryWeight, product of: 2.1 = boost 5.232656 = idf(docFreq=19, maxDocs=1378) 0.08485084 = queryNorm 3.27041 = fieldWeight in 327, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.232656 = idf(docFreq=19, maxDocs=1378) 0.625 = fieldNorm(doc=327) How can i explain the difference? Also, with the factory declared, the score of the url field is still the same, it does not seem to listen to the per-field declared similarity. It also seems the debug output is wrong, it does not write the similarity classname between [] and produces an empty [] for each match. Many thanks and a nice weekend! Markus -Original message- From:Robert Muir rcm...@gmail.com Sent: Fri 01-Jun-2012 17:00 To: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Fri, Jun 1, 2012 at 5:13 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks but i am clearly missing something? We declare the similarity in the fieldType just
Re: per-fieldtype similarity not working
On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi! Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results? Not true: note that two methods (coord and querynorm) are not perfield but global across the entire query tree. By default these are disabled in the wrapper, as they only skew or confuse most modern scoring algorithms (eg all the new ranking algorithms in lucene 4) respectively. So if you want to do per-field scoring where *all* of your sims are vector-space, it could make sense to customize (e.g. subclass) SchemaSimilarityFactory and do something useful for these methods. -- lucidimagination.com
per-fieldtype similarity not working
Hi, We intend to use different similarity implemenations for some field types configured according to SOLR-2338. I doubled checked with the schema in test-files and everything seems fine. However, the result is not correct and debugQuery shows the default configured similarity implementation is being used. We simply declare the following in our fieldType: similarity class=FQCN/ Thanks, Markus
Re: per-fieldtype similarity not working
On Thu, May 31, 2012 at 11:23 AM, Markus Jelsma markus.jel...@openindex.io wrote: We simply declare the following in our fieldType: similarity class=FQCN/ Thats not enough, see the example: http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema-sim.xml -- lucidimagination.com