Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Jonathan Rochkind Wed, 22 Jun 2011 08:41:42 -0700

Yeah, I see your points. It's complicated. I'm not sure either.


But the thing is:

> in order to use a feature like that you'd have to really think hardabout

> the query analysis of your fields, and which ones will produce which
> tokens in which situations

You need to think really hard about the (index and query) analysis ofyour fields and which ones will produce which tokens _now_, if you areusing multiple fields in a 'qf' with differing analysis, and using apercent mm. (Or similarly an mm that varies depending on how many terms).

That's what I've come to realize, that's the status quo. If your qffields don't all have identical analysis, right _now_ you need to thinkreally hard about the analysis and how it's going to possibly effect'mm', including for edge case queries. If you don't, you likely haveedge case queries (at least) which aren't behaving how you expected(whether you notice or have it brought to your attention by users or not).

Or you can just make sure all fields in your qf have identical analysis,and then you don't have to worry about it. But that's not alwayspractical, a lot of the power of dismax qf ends up being combiningfields with different analysis.

So I was trying to think of a way to make this less so, but still beable to take advantage of dismax, but I think you're right that maybethere isn't any, or at least nothing we've come up with yet.

Maybe what I really need is a query parser that does not do "disjunctionmaximum" at all, but somehow still combines different 'qf' type fieldswith different boosts on each field. I personally don't _neccesarily_need the actual "disjunction max" calculation, but I do need combiningof mutiple fields with different boosts. Of course, I'm not sure exactlyhow it would combine multiple fields if not "disjunction maximum", butperhaps one is conceivable that wouldn't be subject to this particulargotcha with differing analysis.

I also remain kind of confused about how the existing dismax figures out"how many terms" for the 'mm' type calculations. If someone wanted toexplain that, I would find it enlightening and helpful forunderstanding what's going on.


Jonathan

On 6/21/2011 10:20 PM, Chris Hostetter wrote:

: not other) setups/intentions.  It's counter-intuitive to me that adding
: a field to the 'qf' set results in _fewer_ hits than the same 'qf' set

agreed .. but that's where looking the debug info comes in to understand
the reason for that behavior is that your old qf treated part of your
input as garbage and that new field respects it and uses it in the
calculation.

mind you: the "fewer hits" behavior only happens when using a percentage
value in mm ... if you had mm=2 you'd get more results, but you've asked
for "66%" (or whatever) and with that new qf there is a differnet number
of clauses produced by query parsing.

: I wonder if it would be a good idea to have a parameter to (e)dismax
: that told it which of these two behaviors to use? The one where the
: 'term count' is based on the maximum number of terms from any field in
: the 'qf', and one where it's based on the minimum number of terms
: produced from any field in the qf?  I am still not sure how feasible

even in your use case, i don't think you are fully considering what that
would produce.  imagine that an mmType=min param existed and gave you what
you're asking for.  Now imagine that you have two fields, one named
"simple" that strips all punctuation and one named "complex" that doesn't,
and you have a query like this...

        q=Foo&  Bar
        qf=simple complex
        mm=100%
        mmType=min

   * Foo produces tokens for all qf
   *&  only produces tokens for some qf (complex)
   * Bar products tokens for all qf

your mmType would say "there are only 2 tokens that we can query across
all fields, so our computed minShouldMatch should be 100% of 2 == 2"

sounds good so far right?

the problem is you still have query clause coming from that "&"
character ... you have 3 real clauses, one of which is that term query for
"complex:&" which means that with your (computed) minShouldMatch of 2 you
would see matches for any doc that happened to have indexed the "&" symbol
in the "complex" field and also matched *either* of Foo or Bar (in either
field)

So while a lot of your results would match both Foo and Bar, you'd get
still get a bunch of weird results.

: Or maybe a feature where you tell dismax, the number of tokens produced
: by field X, THAT's the one you should use for your 'term count' for mm,

Hmmm.... maybe.  i'd have to see a patch in action and play with it, to
really think it through ... hmmm ... honestly i really can't imagine how
that would be helpful in general...

in order to use a feature like that you'd have to really think hard about
the query analysis of your fields, and which ones will produce which
tokens in which situations in order to make sure you pick the *right*
value for that param -- but once you've done that hard thinking you might
as well feed it back into your schema.xml and say "the query analyzer for
field 'complex' should prune any tokens that only contain punctuation"
(instead of saying "'complex' will produce tokens that only contain
punctuation, so lets tell dismax to compute mm based only on 'simple').
Afterall, there might not be one single field that you can pick -- maybe
'complex' lets tokens that are all punctuation through but strips
stopwords, and maybe 'simple' does the opposite ... no param value you
pick will help you with that possibility, you really just need to fix the
query analyzers to make sense if you want to use both of those two fields
in the qf.


-Hoss

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to