It's not necessary. It's only query time.
On Fri, Feb 1, 2013 at 5:00 PM, Sandeep Mestry <sanmes...@gmail.com> wrote: > Hi.. > > Could you tell me if changing default similarity to custom implementation > will require me to rebuild the index? Or will it be used only query time? > > thanks, > Sandeep > On 31 Jan 2013 13:55, "Felipe Lahti" <fla...@thoughtworks.com> wrote: > > > So, it depends of your business requirement, right? If a document has > > matches in more searchable fields, at least for me, this document is more > > important than other document that has less matches. > > > > Example: > > Put this in your schema: > > <similarity class="com.your.namespace.NoIDFSimilarity" /> > > > > And create a class in your classpath of your Solr: > > > > package com.your.namespace; > > > > import org.apache.lucene.search.similarities.DefaultSimilarity; > > > > public class NoIDFSimilarity extends DefaultSimilarity { > > > > @Override > > > > public float idf(long docFreq, long numDocs) { > > > > return 1; > > > > } > > > > } > > > > > > It will "neutralize" the idf (which is the rarity of term). > > > > > > > > > > > > > > On Thu, Jan 31, 2013 at 5:31 AM, Sandeep Mestry <sanmes...@gmail.com> > > wrote: > > > > > Thanks Felipe.. > > > Can you point me an example please? > > > > > > Also forgive me but if a document has matches in more searchable fields > > > then should it not rank higher? > > > > > > Thanks, > > > Sandeep > > > On 30 Jan 2013 19:30, "Felipe Lahti" <fla...@thoughtworks.com> wrote: > > > > > > > If you compare the first and last document scores you will see that > the > > > > last one matches more fields than first one. So, you maybe thinking > > why? > > > > The first doc only matches "contributions" field and the last > matches a > > > > bunch of fields so if you want to have behave more like (<str > > > > name="qf">series_title^500 title^100 description^15 > contribution</str>) > > > you > > > > have to override the method of DefaultSimilarity. > > > > > > > > > > > > On Wed, Jan 30, 2013 at 4:12 PM, Sandeep Mestry <sanmes...@gmail.com > > > > > > wrote: > > > > > > > > > I have pasted it below and it is slightly variant from the dismax > > > > > configuration I have mentioned above as I was playing with all > sorts > > of > > > > > boost values, however it looks more lie below: > > > > > > > > > > <str name="c208c2ca-4270-27b8-e040-a8c00409063a"> > > > > > 2675.7844 = (MATCH) sum of: 2675.7844 = (MATCH) max plus 0.01 times > > > > others > > > > > of: 2675.7844 = (MATCH) weight(contributions:news in 63298) > > > > > [DefaultSimilarity], result of: 2675.7844 = > score(doc=63298,freq=1.0 > > = > > > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > > > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = > queryNorm > > > > > 595177.7 = fieldWeight in 63298, product of: 1.0 = tf(freq=1.0), > with > > > > freq > > > > > of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, > maxDocs=11282414) > > > > > 40960.0 = fieldNorm(doc=63298) > > > > > </str> > > > > > <str name="c208c2a9-66bc-27b8-e040-a8c00409063a"> > > > > > 2317.297 = (MATCH) sum of: 2317.297 = (MATCH) max plus 0.01 times > > > others > > > > > of: 2317.297 = (MATCH) weight(contributions:news in 9826415) > > > > > [DefaultSimilarity], result of: 2317.297 = > > score(doc=9826415,freq=3.0 = > > > > > termFreq=3.0 ), product of: 0.004495774 = queryWeight, product of: > > > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = > queryNorm > > > > > 515439.0 = fieldWeight in 9826415, product of: 1.7320508 = > > > tf(freq=3.0), > > > > > with freq of: 3.0 = termFreq=3.0 14.530705 = idf(docFreq=14, > > > > > maxDocs=11282414) 20480.0 = fieldNorm(doc=9826415) > > > > > </str> > > > > > <str name="c208c2aa-1806-27b8-e040-a8c00409063a"> > > > > > 2140.6274 = (MATCH) sum of: 2140.6274 = (MATCH) max plus 0.01 times > > > > others > > > > > of: 2140.6274 = (MATCH) weight(contributions:news in 9882325) > > > > > [DefaultSimilarity], result of: 2140.6274 = > > score(doc=9882325,freq=1.0 > > > = > > > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > > > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = > queryNorm > > > > > 476142.16 = fieldWeight in 9882325, product of: 1.0 = tf(freq=1.0), > > > with > > > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, > > > maxDocs=11282414) > > > > > 32768.0 = fieldNorm(doc=9882325) > > > > > </str> > > > > > <str name="c208c2b0-5165-27b8-e040-a8c00409063a"> > > > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times > > > > others > > > > > of: 1605.4707 = (MATCH) weight(contributions:news in 220007) > > > > > [DefaultSimilarity], result of: 1605.4707 = > > score(doc=220007,freq=1.0 = > > > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > > > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = > queryNorm > > > > > 357106.62 = fieldWeight in 220007, product of: 1.0 = tf(freq=1.0), > > with > > > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, > > > maxDocs=11282414) > > > > > 24576.0 = fieldNorm(doc=220007) > > > > > </str> > > > > > <str name="c208c2cc-d01b-27b8-e040-a8c00409063a"> > > > > > 1605.4707 = (MATCH) sum of: 1605.4707 = (MATCH) max plus 0.01 times > > > > others > > > > > of: 1605.4707 = (MATCH) weight(contributions:news in 241151) > > > > > [DefaultSimilarity], result of: 1605.4707 = > > score(doc=241151,freq=1.0 = > > > > > termFreq=1.0 ), product of: 0.004495774 = queryWeight, product of: > > > > > 14.530705 = idf(docFreq=14, maxDocs=11282414) 3.093982E-4 = > queryNorm > > > > > 357106.62 = fieldWeight in 241151, product of: 1.0 = tf(freq=1.0), > > with > > > > > freq of: 1.0 = termFreq=1.0 14.530705 = idf(docFreq=14, > > > maxDocs=11282414) > > > > > 24576.0 = fieldNorm(doc=241151) > > > > > </str> > > > > > </lst> > > > > > <str > name="otherQuery">id:c208c2b4-1b3e-27b8-e040-a8c00409063a</str> > > > > > <lst name="explainOther"> > > > > > <str name="*c208c2b4-1b3e-27b8-e040-a8c00409063a*"> <!-- this > should > > > rank > > > > > higher --> > > > > > 6.5742764 = (MATCH) sum of: 6.5742764 = (MATCH) max plus 0.01 times > > > > others > > > > > of: 3.304414 = (MATCH) weight(description:news^25.0 in 967895) > > > > > [DefaultSimilarity], result of: 3.304414 = > score(doc=967895,freq=1.0 > > = > > > > > termFreq=1.0 ), product of: 0.042727955 = queryWeight, product of: > > > 25.0 = > > > > > boost 5.5240083 = idf(docFreq=122362, maxDocs=11282414) > 3.093982E-4 = > > > > > queryNorm 77.33611 = fieldWeight in 967895, product of: 1.0 = > > > > tf(freq=1.0), > > > > > with freq of: 1.0 = termFreq=1.0 5.5240083 = idf(docFreq=122362, > > > > > maxDocs=11282414) 14.0 = fieldNorm(doc=967895) 5.913381 = (MATCH) > > > > > weight(pg_series_title:news^50.0 in 967895) [DefaultSimilarity], > > result > > > > of: > > > > > 5.913381 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: > > > > > 0.080834694 = queryWeight, product of: 50.0 = boost 5.2252855 = > > > > > idf(docFreq=164961, maxDocs=11282414) 3.093982E-4 = queryNorm > 73.154 > > = > > > > > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq > of: > > > 1.0 > > > > = > > > > > termFreq=1.0 5.2252855 = idf(docFreq=164961, maxDocs=11282414) > 14.0 = > > > > > fieldNorm(doc=967895) 0.18680073 = (MATCH) > > > weight(p_programme_title:news > > > > in > > > > > 967895) [DefaultSimilarity], result of: 0.18680073 = > > > > > score(doc=967895,freq=1.0 = termFreq=1.0 ), product of: > 0.002031815 = > > > > > queryWeight, product of: 6.5669904 = idf(docFreq=43120, > > > maxDocs=11282414) > > > > > 3.093982E-4 = queryNorm 91.93787 = fieldWeight in 967895, product > of: > > > > 1.0 = > > > > > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5669904 = > > > > > idf(docFreq=43120, maxDocs=11282414) 14.0 = fieldNorm(doc=967895) > > > > 6.464123 > > > > > = (MATCH) weight(pg_series_title_ci:news^500.0 in 967895) > > > > > [DefaultSimilarity], result of: 6.464123 = > score(doc=967895,freq=1.0 > > = > > > > > termFreq=1.0 ), product of: 0.99999696 = queryWeight, product of: > > > 500.0 = > > > > > boost 6.4641423 = idf(docFreq=47791, maxDocs=11282414) 3.093982E-4 > = > > > > > queryNorm 6.4641423 = fieldWeight in 967895, product of: 1.0 = > > > > > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.4641423 = > > > > > idf(docFreq=47791, maxDocs=11282414) 1.0 = fieldNorm(doc=967895) > > > > 1.6107484 > > > > > = (MATCH) weight(title_ci:news^100.0 in 967895) > [DefaultSimilarity], > > > > result > > > > > of: 1.6107484 = score(doc=967895,freq=1.0 = termFreq=1.0 ), product > > of: > > > > > 0.22324038 = queryWeight, product of: 100.0 = boost 7.2153096 = > > > > > idf(docFreq=22548, maxDocs=11282414) 3.093982E-4 = queryNorm > > 7.2153096 > > > = > > > > > fieldWeight in 967895, product of: 1.0 = tf(freq=1.0), with freq > of: > > > 1.0 > > > > = > > > > > termFreq=1.0 7.2153096 = idf(docFreq=22548, maxDocs=11282414) 1.0 = > > > > > fieldNorm(doc=967895) > > > > > </str> > > > > > > > > > > > > > > > On 30 January 2013 17:55, Felipe Lahti <fla...@thoughtworks.com> > > > wrote: > > > > > > > > > > > Let me see if I understood your problem: > > > > > > > > > > > > By your first e-mail I think you are worried about the returned > > order > > > > of > > > > > > documents from Solr. Is that correct? If yes, as I said before > it's > > > not > > > > > > only the boosting that influence the order of returned documents. > > > > There's > > > > > > term frequency, IDF(inverse document frequency)... If I > understood > > > > > > correctly by your first e-mail, you are interested in get rid of > > IDF. > > > > So > > > > > > for that, you can create a NoIDFSimilarity class to override the > > > > default > > > > > > similarity. > > > > > > > > > > > > Can you paste here the score calculation for one document? > > > > > > > > > > > > > > > > > > On Wed, Jan 30, 2013 at 2:06 PM, Sandeep Mestry < > > sanmes...@gmail.com > > > > > >wrote: > > > > > > > > > > > >> (Sorry for in complete reply in my previous mail, didn't know > > Ctrl F > > > > > sends > > > > > >> an email in Gmail.. ;-)) > > > > > >> > > > > > >> Thanks Felipe, yes I have seen that and my requirement falls for > > > > > >> > > > > > >> How can I make exact-case matches score higher > > > > > >> > > > > > >> Example: a query of "Penguin" should score documents containing > > > > > "Penguin" > > > > > >> higher than docs containing "penguin". > > > > > >> > > > > > >> The general strategy is to index the content twice, using > > different > > > > > fields > > > > > >> with different fieldTypes (and different analyzers associated > with > > > > those > > > > > >> fieldTypes). One analyzer will contain a lowercase filter for > > > > > >> case-insensitive matches, and one will preserve case for > > exact-case > > > > > >> matches. > > > > > >> > > > > > >> Use copyField <http://wiki.apache.org/solr/SchemaXml#copyField> > > > > > commands > > > > > >> in > > > > > >> > > > > > >> the schema to index a single input field multiple times. > > > > > >> > > > > > >> Once the content is indexed into multiple fields that are > analyzed > > > > > >> differently, query across both > > > > > >> fields< > > http://wiki.apache.org/solr/SolrRelevancyFAQ#multiFieldQuery > > > > > > > > > >> > > > > > >> . > > > > > >> > > > > > >> I have added a case insensitive field too to match the exact > > matches > > > > > >> higher, however the result is not even considering the matches > in > > > > field > > > > > - > > > > > >> forget the exact matching part. > > > > > >> > > > > > >> And I have tried the debugQuery option as mentioned in my > previous > > > > mail, > > > > > >> and I have also posted the parsed queries. From the debug > query, I > > > see > > > > > >> that > > > > > >> field boosted with lesser factor (contribution) is still > resulting > > > > > higher > > > > > >> than the one with higher boost factor (series_title). > > > > > >> > > > > > >> > > > > > >> Thanks, > > > > > >> > > > > > >> Sandeep > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On 30 January 2013 16:02, Sandeep Mestry <sanmes...@gmail.com> > > > wrote: > > > > > >> > > > > > >> > Thanks Felipe, yes I have seen that and my requirement > somewhere > > > > falls > > > > > >> for > > > > > >> > > > > > > >> > > > > > > >> > On 30 January 2013 15:53, Felipe Lahti < > fla...@thoughtworks.com > > > > > > > > wrote: > > > > > >> > > > > > > >> >> Hi Sandeep, > > > > > >> >> > > > > > >> >> Quick answer is that not only the boost that you define in > your > > > > > >> >> requestHandler is taken to calculate the score of each > > document. > > > > > There > > > > > >> are > > > > > >> >> others factors that contribute to score calculation. You can > > > take a > > > > > >> look > > > > > >> >> here about http://wiki.apache.org/solr/SolrRelevancyFAQ. > Also, > > > you > > > > > can > > > > > >> >> see > > > > > >> >> using debugQuery=true the score calculation for each document > > > > > returned. > > > > > >> >> > > > > > >> >> Let me know you need something else. > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> On Wed, Jan 30, 2013 at 1:13 PM, Sandeep Mestry < > > > > sanmes...@gmail.com > > > > > > > > > > > >> >> wrote: > > > > > >> >> > > > > > >> >> > Hi All, > > > > > >> >> > > > > > > >> >> > I'm facing an issue in relevancy calculation by dismax > query > > > > > parser. > > > > > >> >> > The boost factor applied does not work as expected in > certain > > > > cases > > > > > >> when > > > > > >> >> > the keyword is generic and by generic I mean, if the > keyword > > is > > > > > >> >> appearing > > > > > >> >> > many times in the document as well as in the index. > > > > > >> >> > > > > > > >> >> > I have parser configuration as below: > > > > > >> >> > > > > > > >> >> > <requestHandler name="querydismax" > > class="solr.SearchHandler" > > > > > > >> >> > <lst name="defaults"> > > > > > >> >> > <str name="defType">edismax</str> > > > > > >> >> > <str name="echoParams">explicit</str> > > > > > >> >> > <float name="tie">0.01</float> > > > > > >> >> > <str name="qf">series_title^500 title^100 > > > > > description^15 > > > > > >> >> > contribution</str> > > > > > >> >> > <str name="pf">series_title^200</str> > > > > > >> >> > <int name="ps">0</int> > > > > > >> >> > <str name="q.alt">*:*</str> > > > > > >> >> > </lst> > > > > > >> >> > </requestHandler> > > > > > >> >> > > > > > > >> >> > As you can see above, I'd expect the documents containing > the > > > > > matches > > > > > >> >> for > > > > > >> >> > series title should rank higher than the ones in > > contribution. > > > > > >> >> > > > > > > >> >> > This works well, if I type in a query like 'wonderworld' > > which > > > > is a > > > > > >> less > > > > > >> >> > occurring term and the series titles rank higher. But, if I > > > type > > > > > in a > > > > > >> >> > keyword like 'news' which is the most common term in the > > > index, I > > > > > get > > > > > >> >> hits > > > > > >> >> > in contributions even though I have lots of documents > having > > > word > > > > > >> news > > > > > >> >> in > > > > > >> >> > series title. > > > > > >> >> > > > > > > >> >> > The field definition is as below: > > > > > >> >> > > > > > > >> >> > <field name="series_title" type="text_wc" indexed="true" > > > > > >> stored="true" > > > > > >> >> > multiValued="false" /> > > > > > >> >> > <field name="title" type="text_wc" indexed="true" > > stored="true" > > > > > >> >> > multiValued="false" /> > > > > > >> >> > <field name="description" type="text_wc" indexed="true" > > > > > stored="true" > > > > > >> >> > multiValued="false" /> > > > > > >> >> > <field name="contribution" type="text" indexed="true" > > > > stored="true" > > > > > >> >> > multiValued="true" /> > > > > > >> >> > > > > > > >> >> > <fieldType name="text" class="solr.TextField" > > > > > >> positionIncrementGap="100" > > > > > >> >> > compressThreshold="10"> > > > > > >> >> > <analyzer type="index"> > > > > > >> >> > <tokenizer > > > > > class="solr.WhitespaceTokenizerFactory"/> > > > > > >> >> > <filter > > class="solr.WordDelimiterFilterFactory" > > > > > >> >> > generateWordParts="1" generateNumberParts="1" > > catenateWords="1" > > > > > >> >> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > >> >> > <filter > class="solr.LowerCaseFilterFactory"/> > > > > > >> >> > </analyzer> > > > > > >> >> > <analyzer type="query"> > > > > > >> >> > <tokenizer > > > > > class="solr.WhitespaceTokenizerFactory"/> > > > > > >> >> > <filter > > class="solr.WordDelimiterFilterFactory" > > > > > >> >> > generateWordParts="1" generateNumberParts="1" > > catenateWords="0" > > > > > >> >> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> > > > > > >> >> > <filter > class="solr.LowerCaseFilterFactory"/> > > > > > >> >> > </analyzer> > > > > > >> >> > </fieldType> > > > > > >> >> > > > > > > >> >> > <fieldType name="text_wc" class="solr.TextField" > > > > > >> >> positionIncrementGap="100" > > > > > >> >> > > > > > > > >> >> > <analyzer type="index"> > > > > > >> >> > <tokenizer > > > > > class="solr.WhitespaceTokenizerFactory"/> > > > > > >> >> > <filter > > class="solr.WordDelimiterFilterFactory" > > > > > >> >> > stemEnglishPossessive="0" generateWordParts="1" > > > > > >> generateNumberParts="1" > > > > > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1" > > > > > >> >> splitOnCaseChange="1" > > > > > >> >> > splitOnNumerics="0" preserveOriginal="1" /> > > > > > >> >> > <filter > class="solr.LowerCaseFilterFactory"/> > > > > > >> >> > </analyzer> > > > > > >> >> > <analyzer type="query"> > > > > > >> >> > <tokenizer > > > > > class="solr.WhitespaceTokenizerFactory"/> > > > > > >> >> > <filter > > class="solr.WordDelimiterFilterFactory" > > > > > >> >> > stemEnglishPossessive="0" generateWordParts="1" > > > > > >> generateNumberParts="1" > > > > > >> >> > catenateWords="1" catenateNumbers="1" catenateAll="1" > > > > > >> >> splitOnCaseChange="1" > > > > > >> >> > splitOnNumerics="0" preserveOriginal="1" /> > > > > > >> >> > <filter > class="solr.LowerCaseFilterFactory"/> > > > > > >> >> > </analyzer> > > > > > >> >> > </fieldType> > > > > > >> >> > > > > > > >> >> > I have tried debugging and when I use query term news, I > see > > > that > > > > > >> >> matches > > > > > >> >> > for contributions are ranked higher than series title. The > > > parsed > > > > > >> >> queries > > > > > >> >> > look like below: > > > > > >> >> > (Note that I have edited the query as in reality I have lot > > of > > > > > fields > > > > > >> >> that > > > > > >> >> > are searchable and I have only mentioned the fields > > containing > > > > text > > > > > >> >> data - > > > > > >> >> > rest all contain uuids) > > > > > >> >> > > > > > > >> >> > <str name="parsedquery"> > > > > > >> >> > (+DisjunctionMaxQuery((description:news^15.0 | > > > title:news^100.0 | > > > > > >> >> > contributions:news | series_title:news^500.0)~0.01) () () > () > > () > > > > () > > > > > () > > > > > >> >> () () > > > > > >> >> > () () () () () () () () () () () () () () () () () () () > > > > > ())/no_coord > > > > > >> >> > </str> > > > > > >> >> > <str name="parsedquery_toString"> > > > > > >> >> > +(description:news^15 | title:news^100.0 | > > contributions:news | > > > > > >> >> > series_title:news^500.0)~0.01 () () () () () () () () () () > > () > > > () > > > > > () > > > > > >> () > > > > > >> >> () > > > > > >> >> > () () () () () () () () () () () () () > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > Could you guide me in right direction please? > > > > > >> >> > > > > > > >> >> > Many Thanks, > > > > > >> >> > Sandeep > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> -- > > > > > >> >> Felipe Lahti > > > > > >> >> Consultant Developer - ThoughtWorks Porto Alegre > > > > > >> >> > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Felipe Lahti > > > > > > Consultant Developer - ThoughtWorks Porto Alegre > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Felipe Lahti > > > > Consultant Developer - ThoughtWorks Porto Alegre > > > > > > > > > > > > > > > -- > > Felipe Lahti > > Consultant Developer - ThoughtWorks Porto Alegre > > > -- Felipe Lahti Consultant Developer - ThoughtWorks Porto Alegre