Dismax and phrases

2011-10-19 Thread Hyttinen Lauri

Hello,

I've inherited a solr-lucene project which I continue to develop. This 
particular SOLR (1.4.1) uses dismax for the queries but I am getting 
some results that I do not understand. Mainly when I search for two 
terms I get some results however when I put quotes around the two terms 
I get a lot more results which goes against my understanding of what 
should happen ie. a lesser set of results. Where should I start digging 
for the answer? solrconfiq.xql or some other place?


Best regards,
Lauri Hyttinen


Re: Dismax and phrases

2011-10-20 Thread Hyttinen Lauri

Thank you Otis for the answer.

I've played around with the solr admin query interface and I've managed 
to confuse myself even more.

If I query without the quotes solr seems to form two parsedqueries

+((DisjunctionMaxQuery(( -first word stuff- )) DisjunctionMaxQuery(( 
-second word stuff- ))


and then based on the query give out results which have -both- words. 
Default operator is OR in schema.xml.


With quotes the query is different with only one DisjunctionMaxQuery in 
parsedquery but the results (of which there are more than double) have 
pages in them
which have only one of the words (granted these results are much lower 
than the ones with both words)


I set qs to 0. (and I even played with pf and ps before commenting them 
out since they relate to automaticed phrased queries?)


Best regards,
Lauri

PS. I am not unhappy with the results so to speak but perplexed and 
don't know how to explain this number discrepancy to project members 
other than

"Dismax is different."


On 10/19/2011 04:28 PM, Otis Gospodnetic wrote:

Lauri,

Start with adding&debugQuery=true to your URL calls to Solr and look at how the 
queries are getting rewritten to understand what is going on.  What you are seeing 
is actually expected, so if you want your phrase query to be a strict phrase query, 
just use standard request handler, not dismax.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



____
From: Hyttinen Lauri
To: solr-user@lucene.apache.org
Sent: Wednesday, October 19, 2011 5:02 AM
Subject: Dismax and phrases

Hello,

I've inherited a solr-lucene project which I continue to develop. This 
particular SOLR (1.4.1) uses dismax for the queries but I am getting some 
results that I do not understand. Mainly when I search for two terms I get some 
results however when I put quotes around the two terms I get a lot more results 
which goes against my understanding of what should happen ie. a lesser set of 
results. Where should I start digging for the answer? solrconfiq.xql or some 
other place?

Best regards,
Lauri Hyttinen






--
Lauri Hyttinen
Tietopalvelusuunnittelija
Tilastokeskus
Yksikkö
Käyntiosoite: Työpajankatu 13, 00580 Helsinki
Postiosoite: PL 3 A, 00022 Tilastokeskus
puh. 09 1734 
lauri.hytti...@tilastokeskus.fi
www.tilastokeskus.fi



Re: Dismax and phrases

2011-10-23 Thread Hyttinen Lauri

On 10/23/2011 09:34 PM, Erick Erickson wrote:

Hmmm dismax is, indeed, different. Note that dismax doesn't respect
the default operator at all, so don't be mislead there.

Could you paste the debug output for both the queries? Perhaps something
will jump out at us.

Best
Erick

Thank you Erick. I've tried to paste the query results here.
First one is the query with ""'s around the terms and returns 6888 results.
I've hid the explain parts of most of the results (and timing) just to 
keep the email reasonably short.

If you need to see them let me know.
+ designates hidden "subtree".

Best regards,
Lauri



0
91



on

standard
2.2
10
*,score
on
0
"asuntojen hinnat"
dismax





+



asuntojenhinnat

"asuntojen hinnat"
"asuntojen hinnat"

+DisjunctionMaxQuery((table.title_t:"asuntojen 
hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen 
hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto 
table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | 
graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto 
graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto 
table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | 
text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | 
(table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto 
title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)


+(table.title_t:"asuntojen hinnat"^2.0 
| title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | 
(text_fi:asunto text_fi:hinta) | (table.description_fi:asunto 
table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | 
graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto 
graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto 
table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | 
text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | 
(table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto 
title_fi:hinta)^2.0))~0.01 () type:tie^6.0 type:kuv^2.0 type:tau^2.0 
(1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0



name="/media/nss/DATA2/data/wwwprod/til/ashi/2011/07/ashi_2011_07_2011-08-26_tie_001_fi.html">

3.1653805 = (MATCH) sum of:
  1.9299976 = (MATCH) max plus 0.01 times others of:
1.9211313 = weight(title_t:"asuntojen hinnat"^2.0 in 5891), product of:
  0.26658234 = queryWeight(title_t:"asuntojen hinnat"^2.0), product of:
2.0 = boost
14.413042 = idf(title_t: asuntojen=250 hinnat=329)
0.009247955 = queryNorm
  7.206521 = fieldWeight(title_t:"asuntojen hinnat" in 5891), 
product of:

1.0 = tf(phraseFreq=1.0)
14.413042 = idf(title_t: asuntojen=250 hinnat=329)
0.5 = fieldNorm(field=title_t, doc=5891)
0.03292808 = (MATCH) sum of:
  0.016520109 = (MATCH) weight(text_fi:asunto in 5891), product of:
0.044221584 = queryWeight(text_fi:asunto), product of:
  4.781769 = idf(docFreq=3251, maxDocs=142742)
  0.009247955 = queryNorm
0.3735757 = (MATCH) fieldWeight(text_fi:asunto in 5891), 
product of:

  1.0 = tf(termFreq(text_fi:asunto)=1)
  4.781769 = idf(docFreq=3251, maxDocs=142742)
  0.078125 = fieldNorm(field=text_fi, doc=5891)
  0.016407972 = (MATCH) weight(text_fi:hinta in 5891), product of:
0.03705935 = queryWeight(text_fi:hinta), product of:
  4.0073023 = idf(docFreq=7054, maxDocs=142742)
  0.009247955 = queryNorm
0.44274852 = (MATCH) fieldWeight(text_fi:hinta in 5891), 
product of:

  1.4142135 = tf(termFreq(text_fi:hinta)=2)
  4.0073023 = idf(docFreq=7054, maxDocs=142742)
  0.078125 = fieldNorm(field=text_fi, doc=5891)
0.34379265 = (MATCH) sum of:
  0.19207533 = (MATCH) weight(graphic.title_fi:asunto in 5891), 
product of:

0.10662244 = queryWeight(graphic.title_fi:asunto), product of:
  5.76465 = idf(docFreq=1216, maxDocs=142742)
  0.01849591 = queryNorm
1.8014531 = (MATCH) fieldWeight(graphic.title_fi:asunto in 
5891), product of:

  1.0 = tf(termFreq(graphic.title_fi:asunto)=1)
  5.76465 = idf(docFreq=1216, maxDocs=142742)
  0.3125 = fieldNorm(field=graphic.title_fi, doc=5891)
  0.15171732 = (MATCH) weight(graphic.title_fi:hinta in 5891), 
product of:

0.09476117 = queryWeight(graphic.title_fi:hinta), product of:
  5.1233582 = idf(docFreq=2310, maxDocs=142742)
  0.01849591 = queryNorm
1.6010494 = (MATCH) fieldWeight(graphic.title_fi:hinta in 
5891), product of:

  1.0 = tf(termFreq(graphic.title_fi:hinta)=1)
  5.1233582 = idf(docFreq=2310, maxDocs=142742)
  0.3125 = fieldNorm(field=graphic.title_fi, doc=5891)
0.5099132 = (MATCH) sum of:
  0.302103 = (MATCH) weight(title_fi:asunto in 5891), product of:
 

Re: Dismax and phrases

2011-11-06 Thread Hyttinen Lauri

Hello,

I am starting to wonder whether the module giving finnish language 
support (lingsoft) might be the cause?
Like I earlier said I have inherited this project so my understanding of 
all the bells and whistles is a bit limited.


Some selected parts from the schema.xml file:


...
















...
multiValued="true" required="false" />

...
multiValued="true"/>

...

multiValued="true" />



multiValued="true"/>


stored="true" />


Best regards,
Lauri Hyttinen


On 11/03/2011 10:09 PM, Chris Hostetter wrote:

Interesting, in the case where you use quotes...

: +
...
:"asuntojen hinnat"
:"asuntojen hinnat"

...there is one DisjunctionMaxQuery (expected) for the entire phrase,
but in the sub-clauses for each individual field the clauses coming from
your "_fi" fields are just building boolean "OR" queries of the terms from
your phrase (instead of building an actual phrase query...

:+DisjunctionMaxQuery((table.title_t:"asuntojen
: hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" |
: (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto
: table.description_fi:hinta) | table.description_t:"asuntojen hinnat" |
: graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto
: graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto
: table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" |
: text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) |
: (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto
: title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)

...is this perhaps a side effect of the new autoGeneratePhraseQueries
option? ... you are explicitly specifying a quoted phrase, but
maybe somehwere in the code path of the dismax parser that information is
getting lost?

can you post the details of your schema.xml?  (ie: the "version" property
on the schema file, and the dynamicField/field + fieldType definitions for
all these fields)

In contrast, your unquoted example is working exactly as i'd expect.  a
DisjunctionMaxQuery is built for each clause of the input, and the two
DisjunctionMaxQuery objects are then combined in a BooleanQuery where the
minNrShouldMatch property is set to "2"

: +
...
:asuntojen hinnat
:asuntojen hinnat
:
:+((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 |
: title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto |
: table.description_fi:asunto | table.description_t:asuntojen |
: graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 |
: table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen |
: ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01)
: DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 |
: ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta |
: table.description_t:hinnat | graphic.title_t:hinnat^2.0 |
: graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 |
: table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta |
: table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0
: type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)


-Hoss




--
Lauri Hyttinen
Tietopalvelusuunnittelija
Tilastokeskus
Yksikkö
Käyntiosoite: Työpajankatu 13, 00580 Helsinki
Postiosoite: PL 3 A, 00022 Tilastokeskus
puh. 09 1734 
lauri.hytti...@tilastokeskus.fi
www.tilastokeskus.fi