Dismax and phrases
Hello, I've inherited a solr-lucene project which I continue to develop. This particular SOLR (1.4.1) uses dismax for the queries but I am getting some results that I do not understand. Mainly when I search for two terms I get some results however when I put quotes around the two terms I get a lot more results which goes against my understanding of what should happen ie. a lesser set of results. Where should I start digging for the answer? solrconfiq.xql or some other place? Best regards, Lauri Hyttinen
Re: Dismax and phrases
Thank you Otis for the answer. I've played around with the solr admin query interface and I've managed to confuse myself even more. If I query without the quotes solr seems to form two parsedqueries +((DisjunctionMaxQuery(( -first word stuff- )) DisjunctionMaxQuery(( -second word stuff- )) and then based on the query give out results which have -both- words. Default operator is OR in schema.xml. With quotes the query is different with only one DisjunctionMaxQuery in parsedquery but the results (of which there are more than double) have pages in them which have only one of the words (granted these results are much lower than the ones with both words) I set qs to 0. (and I even played with pf and ps before commenting them out since they relate to automaticed phrased queries?) Best regards, Lauri PS. I am not unhappy with the results so to speak but perplexed and don't know how to explain this number discrepancy to project members other than "Dismax is different." On 10/19/2011 04:28 PM, Otis Gospodnetic wrote: Lauri, Start with adding&debugQuery=true to your URL calls to Solr and look at how the queries are getting rewritten to understand what is going on. What you are seeing is actually expected, so if you want your phrase query to be a strict phrase query, just use standard request handler, not dismax. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ____ From: Hyttinen Lauri To: solr-user@lucene.apache.org Sent: Wednesday, October 19, 2011 5:02 AM Subject: Dismax and phrases Hello, I've inherited a solr-lucene project which I continue to develop. This particular SOLR (1.4.1) uses dismax for the queries but I am getting some results that I do not understand. Mainly when I search for two terms I get some results however when I put quotes around the two terms I get a lot more results which goes against my understanding of what should happen ie. a lesser set of results. Where should I start digging for the answer? solrconfiq.xql or some other place? Best regards, Lauri Hyttinen -- Lauri Hyttinen Tietopalvelusuunnittelija Tilastokeskus Yksikkö Käyntiosoite: Työpajankatu 13, 00580 Helsinki Postiosoite: PL 3 A, 00022 Tilastokeskus puh. 09 1734 lauri.hytti...@tilastokeskus.fi www.tilastokeskus.fi
Re: Dismax and phrases
On 10/23/2011 09:34 PM, Erick Erickson wrote: Hmmm dismax is, indeed, different. Note that dismax doesn't respect the default operator at all, so don't be mislead there. Could you paste the debug output for both the queries? Perhaps something will jump out at us. Best Erick Thank you Erick. I've tried to paste the query results here. First one is the query with ""'s around the terms and returns 6888 results. I've hid the explain parts of most of the results (and timing) just to keep the email reasonably short. If you need to see them let me know. + designates hidden "subtree". Best regards, Lauri 0 91 on standard 2.2 10 *,score on 0 "asuntojen hinnat" dismax + asuntojenhinnat "asuntojen hinnat" "asuntojen hinnat" +DisjunctionMaxQuery((table.title_t:"asuntojen hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0) +(table.title_t:"asuntojen hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01 () type:tie^6.0 type:kuv^2.0 type:tau^2.0 (1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0 name="/media/nss/DATA2/data/wwwprod/til/ashi/2011/07/ashi_2011_07_2011-08-26_tie_001_fi.html"> 3.1653805 = (MATCH) sum of: 1.9299976 = (MATCH) max plus 0.01 times others of: 1.9211313 = weight(title_t:"asuntojen hinnat"^2.0 in 5891), product of: 0.26658234 = queryWeight(title_t:"asuntojen hinnat"^2.0), product of: 2.0 = boost 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.009247955 = queryNorm 7.206521 = fieldWeight(title_t:"asuntojen hinnat" in 5891), product of: 1.0 = tf(phraseFreq=1.0) 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.5 = fieldNorm(field=title_t, doc=5891) 0.03292808 = (MATCH) sum of: 0.016520109 = (MATCH) weight(text_fi:asunto in 5891), product of: 0.044221584 = queryWeight(text_fi:asunto), product of: 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.009247955 = queryNorm 0.3735757 = (MATCH) fieldWeight(text_fi:asunto in 5891), product of: 1.0 = tf(termFreq(text_fi:asunto)=1) 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.016407972 = (MATCH) weight(text_fi:hinta in 5891), product of: 0.03705935 = queryWeight(text_fi:hinta), product of: 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.009247955 = queryNorm 0.44274852 = (MATCH) fieldWeight(text_fi:hinta in 5891), product of: 1.4142135 = tf(termFreq(text_fi:hinta)=2) 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.34379265 = (MATCH) sum of: 0.19207533 = (MATCH) weight(graphic.title_fi:asunto in 5891), product of: 0.10662244 = queryWeight(graphic.title_fi:asunto), product of: 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.01849591 = queryNorm 1.8014531 = (MATCH) fieldWeight(graphic.title_fi:asunto in 5891), product of: 1.0 = tf(termFreq(graphic.title_fi:asunto)=1) 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.3125 = fieldNorm(field=graphic.title_fi, doc=5891) 0.15171732 = (MATCH) weight(graphic.title_fi:hinta in 5891), product of: 0.09476117 = queryWeight(graphic.title_fi:hinta), product of: 5.1233582 = idf(docFreq=2310, maxDocs=142742) 0.01849591 = queryNorm 1.6010494 = (MATCH) fieldWeight(graphic.title_fi:hinta in 5891), product of: 1.0 = tf(termFreq(graphic.title_fi:hinta)=1) 5.1233582 = idf(docFreq=2310, maxDocs=142742) 0.3125 = fieldNorm(field=graphic.title_fi, doc=5891) 0.5099132 = (MATCH) sum of: 0.302103 = (MATCH) weight(title_fi:asunto in 5891), product of:
Re: Dismax and phrases
Hello, I am starting to wonder whether the module giving finnish language support (lingsoft) might be the cause? Like I earlier said I have inherited this project so my understanding of all the bells and whistles is a bit limited. Some selected parts from the schema.xml file: ... ... multiValued="true" required="false" /> ... multiValued="true"/> ... multiValued="true" /> multiValued="true"/> stored="true" /> Best regards, Lauri Hyttinen On 11/03/2011 10:09 PM, Chris Hostetter wrote: Interesting, in the case where you use quotes... : + ... :"asuntojen hinnat" :"asuntojen hinnat" ...there is one DisjunctionMaxQuery (expected) for the entire phrase, but in the sub-clauses for each individual field the clauses coming from your "_fi" fields are just building boolean "OR" queries of the terms from your phrase (instead of building an actual phrase query... :+DisjunctionMaxQuery((table.title_t:"asuntojen : hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | : (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto : table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | : graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto : graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto : table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | : text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | : (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto : title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0) ...is this perhaps a side effect of the new autoGeneratePhraseQueries option? ... you are explicitly specifying a quoted phrase, but maybe somehwere in the code path of the dismax parser that information is getting lost? can you post the details of your schema.xml? (ie: the "version" property on the schema file, and the dynamicField/field + fieldType definitions for all these fields) In contrast, your unquoted example is working exactly as i'd expect. a DisjunctionMaxQuery is built for each clause of the input, and the two DisjunctionMaxQuery objects are then combined in a BooleanQuery where the minNrShouldMatch property is set to "2" : + ... :asuntojen hinnat :asuntojen hinnat : :+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 | : title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto | : table.description_fi:asunto | table.description_t:asuntojen | : graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 | : table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen | : ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01) : DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 | : ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta | : table.description_t:hinnat | graphic.title_t:hinnat^2.0 | : graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 | : table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta | : table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0 : type:kuv^2.0 type:tau^2.0 : FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0) -Hoss -- Lauri Hyttinen Tietopalvelusuunnittelija Tilastokeskus Yksikkö Käyntiosoite: Työpajankatu 13, 00580 Helsinki Postiosoite: PL 3 A, 00022 Tilastokeskus puh. 09 1734 lauri.hytti...@tilastokeskus.fi www.tilastokeskus.fi