David, I just posted WikipediaSimilarity to Bug 32674. I've also reviewed and tested the port to Java 1.4 -- it's fine (although all the casts remind me why I like 1.5 so much). Thanks to Miles Barr for this port!
You don't want any of the test classes. You just need these 4 classes: DistributingMultiFieldQueryParser MaxDisjunctionQuery MaxDiskunctionScorer WikipediaSimilarity WikipediaSimilarity can be placed in whatever package makes sense within your app. It assumes your fields are named "title" and "body" (only "body" is explicitly referenced). If the names are different, you can change the single reference to "body" in lengthNorm(). I'm assuming title is just the title, and body is just the article without its title (i.e., the two fields are disjoint). If the semantics of the fields are different, please let me know as this would imply a need to change things more fundamentally. The Wikipedia attachment suggests the same initial values to try for DEFAULT_BOOSTS as in my previous message. I believe in tuning per application and these classes will almost certainly require tuning for the Wikipedia as my collection is very different. Please let me know when I can take a look at it. It would be most efficient if there is some way I could directly tune the parameters (DEFAULT_BOOSTS and the formulas in WikipediaSimilarity). Again, the explain mechanism will be most helpful in doing this. Thanks, Chuck > -----Original Message----- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 8:53 AM > To: Lucene Developers List > Subject: RE: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > David Spencer wrote: > > > I'm on JDK 1.4.2_06 and Tomcat 4+. Had issues w/ the Tomcat 5.5+/JDK > 1.5 > > combo so I rolled back. > > There have been issues with Tomcat 5.5, although supposedly the latest > version has them resolved. I'm using Tomcat 5.0.28 with JDK 1.5.0_01, > which has been solid -- no problems at all. But your combo should be > fine; I just want to verify Miles Barr's changes to remove the 1.5 > dependencies from my classes. > > > The baseline will presumably use the default Lucene Similarity and > Query > > Parser. > > I think the baseline should use Lucene's MultiFieldQueryParser to expand > the query to search both title and body fields, as this is presumably > the current "out-of-the-box" solution. Similarly, it should use Lucene > 1.4.3, the current official release; is this what you are using? There > may be a desire to use the CVS HEAD instead, which I have never run > with. > > > For "your case" you'll need to tell me if I need to call > > DistributingMultiFieldQueryParser or whatnot. > > Yes, you need something like this: > > private static final String[] DEFAULT_FIELDS = {"title", "body"}; > private static final float[] DEFAULT_BOOSTS = {3.0f, 1.0f}; > > DistributingMultiFieldQueryParser.parse( > queryString, DEFAULT_FIELDS, DEFAULT_BOOSTS, new > StandardAnalyzer()); > > DEFAULT_FIELDS should contain whatever list of fields you are using > (that should be searched for simple query terms containing no explicit > field specs). DEFAULT_BOOSTS must be in 1:1 correspondence with > DEFAULT_FIELDS. > > > Also, dumb question...do I need to build an index for every impl of > > Similarity? Thus there will be "n" indexes and wikipedia-sim.jsp > will > > search in each one with the corresponding Similarity? > > Yes, you will need a separate index for each Similarity, as some values > computed from the similarity are stored in the index. > > I'll send you a Similarity and an initial value for DEFAULT_BOOSTS later > today or tomorrow. > > Can you put up an explain mechanism to support tuning? I'll want to > tune the DEFAULT_BOOSTS and various Similarity factors based on the > collection. > > Thanks, > > Chuck > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]