RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Fri, 28 Jan 2005 13:42:13 -0800

David,

I just posted WikipediaSimilarity to Bug 32674.  I've also reviewed and
tested the port to Java 1.4 -- it's fine (although all the casts remind
me why I like 1.5 so much).  Thanks to Miles Barr for this port!


You don't want any of the test classes.  You just need these 4 classes:

DistributingMultiFieldQueryParser
MaxDisjunctionQuery
MaxDiskunctionScorer
WikipediaSimilarity

WikipediaSimilarity can be placed in whatever package makes sense within
your app.  It assumes your fields are named "title" and "body" (only
"body" is explicitly referenced).  If the names are different, you can
change the single reference to "body" in lengthNorm().  I'm assuming
title is just the title, and body is just the article without its title
(i.e., the two fields are disjoint).  If the semantics of the fields are
different, please let me know as this would imply a need to change
things more fundamentally.

The Wikipedia attachment suggests the same initial values to try for
DEFAULT_BOOSTS as in my previous message.

I believe in tuning per application and these classes will almost
certainly require tuning for the Wikipedia as my collection is very
different.

Please let me know when I can take a look at it.  It would be most
efficient if there is some way I could directly tune the parameters
(DEFAULT_BOOSTS and the formulas in WikipediaSimilarity).  Again, the
explain mechanism will be most helpful in doing this.

Thanks,

Chuck

  > -----Original Message-----
  > From: Chuck Williams [mailto:[EMAIL PROTECTED]
  > Sent: Friday, January 28, 2005 8:53 AM
  > To: Lucene Developers List
  > Subject: RE: Scoring benchmark evaluation. Was RE: How to proceed
with
  > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
  > 
  > David Spencer wrote:
  > 
  >   > I'm on JDK 1.4.2_06 and Tomcat 4+. Had issues w/ the Tomcat
5.5+/JDK
  > 1.5
  >   >   combo so I rolled back.
  > 
  > There have been issues with Tomcat 5.5, although supposedly the
latest
  > version has them resolved.  I'm using Tomcat 5.0.28 with JDK
1.5.0_01,
  > which has been solid -- no problems at all.  But your combo should
be
  > fine; I just want to verify Miles Barr's changes to remove the 1.5
  > dependencies from my classes.
  > 
  >   > The baseline will presumably use the default Lucene Similarity
and
  > Query
  >   > Parser.
  > 
  > I think the baseline should use Lucene's MultiFieldQueryParser to
expand
  > the query to search both title and body fields, as this is
presumably
  > the current "out-of-the-box" solution.  Similarly, it should use
Lucene
  > 1.4.3, the current official release; is this what you are using?
There
  > may be a desire to use the CVS HEAD instead, which I have never run
  > with.
  > 
  >   > For "your case" you'll need to tell me if I need to call
  >   > DistributingMultiFieldQueryParser or whatnot.
  > 
  > Yes, you need something like this:
  > 
  >   private static final String[] DEFAULT_FIELDS = {"title", "body"};
  >   private static final float[] DEFAULT_BOOSTS = {3.0f, 1.0f};
  > 
  >   DistributingMultiFieldQueryParser.parse(
  >     queryString, DEFAULT_FIELDS, DEFAULT_BOOSTS, new
  > StandardAnalyzer());
  > 
  > DEFAULT_FIELDS should contain whatever list of fields you are using
  > (that should be searched for simple query terms containing no
explicit
  > field specs).  DEFAULT_BOOSTS must be in 1:1 correspondence with
  > DEFAULT_FIELDS.
  > 
  >   > Also, dumb question...do I need to build an index for every impl
of
  >   > Similarity? Thus there will be "n" indexes and wikipedia-sim.jsp
  > will
  >   > search in each one with the corresponding Similarity?
  > 
  > Yes, you will need a separate index for each Similarity, as some
values
  > computed from the similarity are stored in the index.
  > 
  > I'll send you a Similarity and an initial value for DEFAULT_BOOSTS
later
  > today or tomorrow.
  > 
  > Can you put up an explain mechanism to support tuning?  I'll want to
  > tune the DEFAULT_BOOSTS and various Similarity factors based on the
  > collection.
  > 
  > Thanks,
  > 
  > Chuck
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to