David, Thanks for taking the lead on this!
You have two fields for this collection, title and body, right? I'd like to configure this to use my DistributingMultiFieldQueryParser, MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity. DistributingMultiFieldQueryParser has a simple API -- it generates the uses of MaxDisjunctionQuery. So these should be easy to integrate. What version of Java are you using? My version of MaxDisjunctionQuery requires Java 1.5. although another user modified it to work with Java 1.4. If you are using 1.4, please let me know so I can look over the modified version. All the classes except the Similarity are already posted (http://issues.apache.org/bugzilla/show_bug.cgi?id=32674). I need to think about what field boosts to use for this collection with DistributingMultiFieldQueryParser, and need to think about tweaking the Similarity (as my database is substantially different from the wikipedia). Side by side comparisons make sense to me. It would be helpful if you could provide an explain button so we can compare the score components and tune. If this makes sense, I should be able to get the remaining pieces to you by tomorrow or this weekend. Chuck > -----Original Message----- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 27, 2005 2:36 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Doug Cutting wrote: > > > Chuck Williams wrote: > > > >> Christoph Goller writes: > >> > You may be right. But I am not completely convinced. I think > >> > this should be decided based on the proposed benchmark evaluation. > >> > >> Is that still happening? > > > > > > Like anything else in an all-volunteer operation, it will only happen > if > > folks volunteer to do it. Someone needs to take the lead and index a > > reference collection with a couple of different Similarity > > implementations and post the code and the results of various searches > > for folks to evaluate. Chuck? > > In theory I can probably easily do this, esp if someone would submit > another Similarity implementation. > > The corpus easiest for me to use is the subset of the English Wikipedia > I've been playing with. It has 400k documents..let's see, max length of > a body is 258kb, avg len of non-trival entries (size > 100 chars) is > 2450 chars, and std dev is 3400 chars. I'm using "wikipedia namespace 0" > which means the normal encyclopedia pages and not things like chatlogs, > help pages, or whatnot. > > I recently made a demo page of the MoreLikeThis similarity query > generator + related algorithms (confusion alert, 'similarity' means > "show me documents similar to another doc", is implemented on top of > Lucene, and is not the same as org.apache.lucene.search.Similarity...) > > The page runs 3 algorithms in parallel and displays them on 1 page. > > Here's a page that shows the 3 cols, 1 per alg: > > http://www.searchmorph.com/kat/wikipedia- > compare.jsp?s=Information_retrieval > > And you get there from a normal wikipedia search, click on "cmp" on the > right of one of the matching docs: > > http://www.searchmorph.com/kat/wikipedia.jsp?s=information+retrieval > > Oh, and the relevance to this thread is, I'm assuming this is what we > want to compare the different Similarity implementations, an easy way of > seeing how they perform against a given query. > > So, moving forward, if anyone agrees in general with me: > > [1] Post some reasonable/interesting Similarity implementations > [2] Confirm that it makes sense to compare them on 1 screen "in > parallel" > > thx, > Dave > > > > > > > > > > Doug > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]