scroll a bit...

Chuck Williams wrote:

David Spencer wrote:
> [1]
> > I currently have 2 variations on the index, one w/ the default
settings
> and another with the Similarity code Chuck attached to the bug
report.
> Do we need other variations on the index e.g. with different
weights, or
> during indexing are the weights less important than the log() vs.
> sqrt(log()) issue?


My Similarity eliminates the idf^2 by using sqrt(log()), changes the
base of the logarithm for flattening tf and idf from e to 10 (or any
parameter setting at runtime), changes the lengthNorm flattening from
sqrt to log base-10 (not settable at runtime), and adds 1000 to all
field lengths (normalizing this re. the log base-10 by changing the
numerator from 1 to 3 = log10(1000)).

The net effects are to increase flattening of tf and idf by a constant,
increase flattening of lengthNorm fundamentally (sqrt to log), and
eliminate large lengthNorm effects with very small fields (further
flattening its effect).

At least in the case of multiple fields with meaningful field-boosts,
I've found these all improve relevance (i.e., in my app).  I found and
made the changes 1-at-a-time based on analyzing explain()'s with result
lists my app produces.

Re. this analysis, any sequencing of considering the different changes
is fine with me, although once again, I don't think these are completely
orthogonal considerations.  The combination of Similarity tuning
decisions has impact above-and-beyond the individual effects.

> [2]
> > I guess it's obvious from the above, but just to make it clear -
I'll
> change the page to only do single field queries - but how many
> variations do we want to see in parallel - the current page shows
2x2
> results, for each combo of index and query - but I, say, show
several
> more queries in parallel w/ different weights...
>


I'd like to keep the current multi-field results as there hasn't been
much analysis of this yet.

Re. other scenarios, I think we should look at:
  1.  Current QueryParser and DefaultSimilarity with single field and
Default-OR.
  2.  Above with Default-AND.
  3.  My Similarity (or subset thereof) and current QueryParser with
Default-OR.
  4.  " with Default-AND


Consideration of proximity solutions (e.g., Doug's DensityQuery for Default-AND, and what I'm proposing for Default-OR) should be separate.


Sorry for the delay in getting back to this thread - hope I found the right place to put the reply.

I did another page (wikipedia-similarity1.jsp) which is like the earlier experiment in that is has 2 versions of the Wikipedia index, one with the default Lucene Similarity, and one with Chucks's proposal ( http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 ).

If you want to skip the explanation just click here for an example results page and play around:

http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=information+retrieval+search+engine&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=2.0000&slop=9999&qp=information+retrieval+search+engine


The difference is that this new page only does single-field queries and does a lot(!) of them. Please don't make any human factors judgments on the page, or submit it to Tufte as an example of how not to present information :)


So - there are 2 indexes and 9 (!) query parsers used for every query, and 18 (2 * 9) searches performed.

Queries are:

q1: MultiFieldQueryParser with OR semantics
q2: Same, with AND

q3: DistributingMultiFieldQueryParser with OR
q4: Holding place for DistributingMultiFieldQueryParser with AND (Chuck?)

q5: Code of mine that does a simple OR, so "a b" => (Query) "a b"
q6: Code of mine that does a simple AND so "a b" => +a +b

q7: Code of mine based on Doug's suggestion somewhere else in this thread, like q5 but tosses in a phrase, so "a b" => a b "a b"~10
q8: Like q7 but AND


q9: Separate call to QueryParser


I'm not convinced MultiFieldQueryParser works right with one field (but maybe that was the point of this thread? :) )
If you search for "blahblahblah java" one would expect the AND queries would return zero matches as blahblahblah does not appear in the corpus:


http://www.searchmorph.com/kat/wikipedia-similarity1.jsp?s=blahblahblah+java&goal=10&tfLogBase=2.3026&idfLogBase=2.3026&phraseBoost=1.0000&slop=10&qp=blahblahblah+java

But q2, MultiFieldQueryParser/AND returns:
                +(blahblahblah java)
instead of
                +blahblahblah +java

The only AND code that works right when one of the terms doesn't match is, um, my humble code (q6/q8).

So, does this make sense and is it useful way of trying to evaluate the Similarities?

I think another thread w/ a different thread has started on this topic, I'll try to redirect it back here.

thx,
 Dave






My $0.02,

Chuck

> -----Original Message-----
> From: David Spencer [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 01, 2005 10:59 AM
> To: Lucene Developers List
> Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
> evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
> problems with Similarity.docFreq() ?
> > Doug Cutting wrote:
> > > David Spencer wrote:
> >
> >>
> >> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> >>
> >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> >>
> >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4)
> (f1:t5^2.0
> >> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5
> >
> >
> > This looks great to me! I'd make mand=true by default, i.e., have
a
> > method where this parameter is not specified. Similarly, we might
> > default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to
> > infinity. What we want is something that provides only the knobs
that
> > we think most folks will need. Ideally we wouldn't even need to
> specify
> > fieldBoosts. Short fields like titles get a larger lengthNorm,
which
> > effectively boosts them a lot already.
> > Yeah I agree w/ all of the above, offer options but have easy to use
> ways of calling it w/ intelligent defaults.
> >
> > But perhaps we should back off and first just evaluate single
field
> > search with different idf, tf (and perhaps lengthNorm and
sloppyFreq)
> > definitions. Once we're happy with those, then we should return
to
> > different multi-field query formulations.
> >
> > Let's start with the issue that's been raised so much: whether idf
is
> > better defined with log() or sqrt(log()).
> > I can redo my page and rebuild indexes if necessary, I just need it
> clarified what we want to do, esp -> does the index need to be
rebuilt?
> > [1]
> > I currently have 2 variations on the index, one w/ the default
settings
> and another with the Similarity code Chuck attached to the bug
report.
> Do we need other variations on the index e.g. with different
weights, or
> during indexing are the weights less important than the log() vs.
> sqrt(log()) issue?
> > [2]
> > I guess it's obvious from the above, but just to make it clear -
I'll
> change the page to only do single field queries - but how many
> variations do we want to see in parallel - the current page shows
2x2
> results, for each combo of index and query - but I, say, show
several
> more queries in parallel w/ different weights...
> > > >
> > Doug
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
[EMAIL PROTECTED]
> >
> > >
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to