Yeah, I see your points. It's complicated. I'm not sure either.
But the thing is:
> in order to use a feature like that you'd have to really think hard
about
> the query analysis of your fields, and which ones will produce which
> tokens in which situations
You need to think really hard about the (index and query) analysis of
your fields and which ones will produce which tokens _now_, if you are
using multiple fields in a 'qf' with differing analysis, and using a
percent mm. (Or similarly an mm that varies depending on how many terms).
That's what I've come to realize, that's the status quo. If your qf
fields don't all have identical analysis, right _now_ you need to think
really hard about the analysis and how it's going to possibly effect
'mm', including for edge case queries. If you don't, you likely have
edge case queries (at least) which aren't behaving how you expected
(whether you notice or have it brought to your attention by users or not).
Or you can just make sure all fields in your qf have identical analysis,
and then you don't have to worry about it. But that's not always
practical, a lot of the power of dismax qf ends up being combining
fields with different analysis.
So I was trying to think of a way to make this less so, but still be
able to take advantage of dismax, but I think you're right that maybe
there isn't any, or at least nothing we've come up with yet.
Maybe what I really need is a query parser that does not do "disjunction
maximum" at all, but somehow still combines different 'qf' type fields
with different boosts on each field. I personally don't _neccesarily_
need the actual "disjunction max" calculation, but I do need combining
of mutiple fields with different boosts. Of course, I'm not sure exactly
how it would combine multiple fields if not "disjunction maximum", but
perhaps one is conceivable that wouldn't be subject to this particular
gotcha with differing analysis.
I also remain kind of confused about how the existing dismax figures out
"how many terms" for the 'mm' type calculations. If someone wanted to
explain that, I would find it enlightening and helpful for
understanding what's going on.
Jonathan
On 6/21/2011 10:20 PM, Chris Hostetter wrote:
: not other) setups/intentions. It's counter-intuitive to me that adding
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set
agreed .. but that's where looking the debug info comes in to understand
the reason for that behavior is that your old qf treated part of your
input as garbage and that new field respects it and uses it in the
calculation.
mind you: the "fewer hits" behavior only happens when using a percentage
value in mm ... if you had mm=2 you'd get more results, but you've asked
for "66%" (or whatever) and with that new qf there is a differnet number
of clauses produced by query parsing.
: I wonder if it would be a good idea to have a parameter to (e)dismax
: that told it which of these two behaviors to use? The one where the
: 'term count' is based on the maximum number of terms from any field in
: the 'qf', and one where it's based on the minimum number of terms
: produced from any field in the qf? I am still not sure how feasible
even in your use case, i don't think you are fully considering what that
would produce. imagine that an mmType=min param existed and gave you what
you're asking for. Now imagine that you have two fields, one named
"simple" that strips all punctuation and one named "complex" that doesn't,
and you have a query like this...
q=Foo& Bar
qf=simple complex
mm=100%
mmType=min
* Foo produces tokens for all qf
*& only produces tokens for some qf (complex)
* Bar products tokens for all qf
your mmType would say "there are only 2 tokens that we can query across
all fields, so our computed minShouldMatch should be 100% of 2 == 2"
sounds good so far right?
the problem is you still have query clause coming from that "&"
character ... you have 3 real clauses, one of which is that term query for
"complex:&" which means that with your (computed) minShouldMatch of 2 you
would see matches for any doc that happened to have indexed the "&" symbol
in the "complex" field and also matched *either* of Foo or Bar (in either
field)
So while a lot of your results would match both Foo and Bar, you'd get
still get a bunch of weird results.
: Or maybe a feature where you tell dismax, the number of tokens produced
: by field X, THAT's the one you should use for your 'term count' for mm,
Hmmm.... maybe. i'd have to see a patch in action and play with it, to
really think it through ... hmmm ... honestly i really can't imagine how
that would be helpful in general...
in order to use a feature like that you'd have to really think hard about
the query analysis of your fields, and which ones will produce which
tokens in which situations in order to make sure you pick the *right*
value for that param -- but once you've done that hard thinking you might
as well feed it back into your schema.xml and say "the query analyzer for
field 'complex' should prune any tokens that only contain punctuation"
(instead of saying "'complex' will produce tokens that only contain
punctuation, so lets tell dismax to compute mm based only on 'simple').
Afterall, there might not be one single field that you can pick -- maybe
'complex' lets tokens that are all punctuation through but strips
stopwords, and maybe 'simple' does the opposite ... no param value you
pick will help you with that possibility, you really just need to fix the
query analyzers to make sense if you want to use both of those two fields
in the qf.
-Hoss