David, thanks for all your work on this! I'm curious to see what others think. To evaluate this properly would require a reasonably systematic analysis of a significant number of queries. On the 4 test queries pre-selected on the page, I think the new approach clearly performs better on "chess champion" and "russian politics", and arguably performs better on "information retrieval search engine". "conspiracy theory" is basically a wash, and fairly uninteresting since the terms appear widely in both titles and content.
"information retrieval search engine" is an interesting example. First, we should forget that we as people understand a close relationship between "information retrieval" and "search engine", since neither approach is using an ontology, nor even any correlation statistics (e.g., LSI). Then, look at the score for the SMART system result in the current code. The article does not mention "search" or "engine" and yet it gets a very high score because "information retrieval" occurs in both title and content. This is a great example of the current approach's failure to assess term diversity (i.e., coverage of distinct terms in the query). Chuck > -----Original Message----- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Monday, January 31, 2005 11:35 AM > To: Lucene Developers List > Subject: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? > > > I worked w/ Chuck to get up a test page that shows search results with 2 > versions of Similarity side by side. > > URL here: > > http://www.searchmorph.com/kat/wikipedia-similarity.jsp > > Weblog entry here w/ some more details: > > http://www.searchmorph.com/weblog/index.php?id=46 > > > But briefly the page uses 2 indexes of the wikipedia. > First index is all default Lucene code, and ditto for the query parser. > > The second index uses Chuck's suggestion for another similarity > implementation, and the search results use this same similarity + the > query parser (DistributingMultiFieldQueryParser) he has proposed. > > The page lets you tune parameters to his Similarity impl so you can see > the effect of different weights. > > One test that seems to show how the new code performs better is the > search for "russian politics" where the results on the right seem more > relevant: > > http://www.searchmorph.com/kat/wikipedia- > similarity.jsp?s=russian+politics > > > > > > > > Chuck Williams wrote: > > > Dave, are you using MultiFieldQueryParser and DefaultSimilarity for > the > > vanilla implementation? > > > > It's important to know what we are comparing... > > > > Chuck > > > > > -----Original Message----- > > > From: David Spencer [mailto:[EMAIL PROTECTED] > > > Sent: Friday, January 28, 2005 3:38 PM > > > To: Lucene Developers List > > > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed > > with > > > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > > > > > Daniel Naber wrote: > > > > > > > On Friday 28 January 2005 22:45, Chuck Williams wrote: > > > > > > > > > > > >>The fact that is requires all terms in all > > > >>fields is part of the problem. Once that is addressed, another > > > problem > > > >>is that Lucene does not provide a good mechanis > > > > > > > > > > > > That's fixed in CVS, so maybe the CVS version should be used for > > the > > > > evaluation. I think it should be robust. > > > > > > Hmmm, is it safe to assume I can build the index w/ lucene- > 1.4.3.jar > > but > > > deploy the webapp for searching w/ lucene-1.5-rc1-dev.jar? > > > > > > And is the current code supposed to build with so many deprecated > > > warnings? > > > > > > - Dave > > > > > > > > > > > Regards > > > > Daniel > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: lucene-dev- > [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]