Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

David Spencer Thu, 27 Jan 2005 23:03:47 -0800

Chuck Williams wrote:

David,
Thanks for taking the lead on this!


You're welcome!

You have two fields for this collection, title and body, right?


Yes.


I'd like to configure this to use my DistributingMultiFieldQueryParser,
MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity.
DistributingMultiFieldQueryParser has a simple API -- it generates the
uses of MaxDisjunctionQuery.  So these should be easy to integrate.

What version of Java are you using?  My version of MaxDisjunctionQuery
requires Java 1.5. although another user modified it to work with Java
1.4.  If you are using 1.4, please let me know so I can look over the
modified version.

I'm on JDK 1.4.2_06 and Tomcat 4+. Had issues w/ the Tomcat 5.5+/JDK 1.5 combo so I rolled back.


All the classes except the Similarity are already posted
(http://issues.apache.org/bugzilla/show_bug.cgi?id=32674).  I need to
think about what field boosts to use for this collection with
DistributingMultiFieldQueryParser, and need to think about tweaking the
Similarity (as my database is substantially different from the
wikipedia).

Side by side comparisons make sense to me.  It would be helpful if you
could provide an explain button so we can compare the score components

I was thinking the same thing.

and tune.

If this makes sense, I should be able to get the remaining pieces to you
by tomorrow or this weekend.

OK - I'll try to catch up on the meaning of DistributingMultiFieldQueryParser et. al. but what's easiest for me is for you to tell me how to turn the String search argument into a Query.

So I'll have a page named something like wikipedia-sim.jsp. It'll be invoked as "wikipedia-sim.jsp?s=big+dog" and the impl will have to turn "big dog" into a Query.

The baseline will presumably use the default Lucene Similarity and Query Parser.

For "your case" you'll need to tell me if I need to call DistributingMultiFieldQueryParser or whatnot.

Also, dumb question...do I need to build an index for every impl of Similarity? Thus there will be "n" indexes and wikipedia-sim.jsp will search in each one with the corresponding Similarity?

thx,
 Dave

Chuck
> -----Original Message----- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 27, 2005 2:36 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Doug Cutting wrote: > > > Chuck Williams wrote: > > > >> Christoph Goller writes: > >> > You may be right. But I am not completely convinced. I think > >> > this should be decided based on the proposed benchmark evaluation. > >> > >> Is that still happening? > > > > > > Like anything else in an all-volunteer operation, it will only happen > if > > folks volunteer to do it. Someone needs to take the lead and index a > > reference collection with a couple of different Similarity > > implementations and post the code and the results of various searches > > for folks to evaluate. Chuck? > > In theory I can probably easily do this, esp if someone would submit > another Similarity implementation. > > The corpus easiest for me to use is the subset of the English Wikipedia > I've been playing with. It has 400k documents..let's see, max length of > a body is 258kb, avg len of non-trival entries (size > 100 chars) is > 2450 chars, and std dev is 3400 chars. I'm using "wikipedia namespace 0" > which means the normal encyclopedia pages and not things like chatlogs, > help pages, or whatnot. > > I recently made a demo page of the MoreLikeThis similarity query > generator + related algorithms (confusion alert, 'similarity' means > "show me documents similar to another doc", is implemented on top of > Lucene, and is not the same as org.apache.lucene.search.Similarity...) > > The page runs 3 algorithms in parallel and displays them on 1 page. > > Here's a page that shows the 3 cols, 1 per alg: > > http://www.searchmorph.com/kat/wikipedia- > compare.jsp?s=Information_retrieval > > And you get there from a normal wikipedia search, click on "cmp" on the > right of one of the matching docs: > > http://www.searchmorph.com/kat/wikipedia.jsp?s=information+retrieval > > Oh, and the relevance to this thread is, I'm assuming this is what we > want to compare the different Similarity implementations, an easy way of > seeing how they perform against a given query. > > So, moving forward, if anyone agrees in general with me: > > [1] Post some reasonable/interesting Similarity implementations > [2] Confirm that it makes sense to compare them on 1 screen "in > parallel" > > thx, > Dave > > > > > > > > > > Doug > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to