RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Mon, 31 Jan 2005 12:47:19 -0800

David, thanks for all your work on this!

I'm curious to see what others think.  To evaluate this properly would
require a reasonably systematic analysis of a significant number of
queries.  On the 4 test queries pre-selected on the page, I think the
new approach clearly performs better on "chess champion" and "russian
politics", and arguably performs better on "information retrieval search
engine".  "conspiracy theory" is basically a wash, and fairly
uninteresting since the terms appear widely in both titles and content.


"information retrieval search engine" is an interesting example.  First,
we should forget that we as people understand a close relationship
between "information retrieval" and "search engine", since neither
approach is using an ontology, nor even any correlation statistics
(e.g., LSI).  Then, look at the score for the SMART system result in the
current code.  The article does not mention "search" or "engine" and yet
it gets a very high score because "information retrieval" occurs in both
title and content.  This is a great example of the current approach's
failure to assess term diversity (i.e., coverage of distinct terms in
the query).

Chuck

  > -----Original Message-----
  > From: David Spencer [mailto:[EMAIL PROTECTED]
  > Sent: Monday, January 31, 2005 11:35 AM
  > To: Lucene Developers List
  > Subject: URL to compare 2 Similarity's ready-- Re: Scoring benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > 
  > I worked w/ Chuck to get up a test page that shows search results
with 2
  > versions of Similarity side by side.
  > 
  > URL here:
  > 
  >     http://www.searchmorph.com/kat/wikipedia-similarity.jsp
  > 
  > Weblog entry here w/ some more details:
  > 
  >     http://www.searchmorph.com/weblog/index.php?id=46
  > 
  > 
  > But briefly the page uses 2 indexes of the wikipedia.
  > First index is all default Lucene code, and ditto for the query
parser.
  > 
  > The second index uses Chuck's suggestion for another similarity
  > implementation, and the search results use this same similarity +
the
  > query parser (DistributingMultiFieldQueryParser) he has proposed.
  > 
  > The page lets you tune parameters to his Similarity impl so you can
see
  > the effect of different weights.
  > 
  > One test that seems to show how the new code performs better is the
  > search for "russian politics" where the results on the right seem
more
  > relevant:
  > 
  > http://www.searchmorph.com/kat/wikipedia-
  > similarity.jsp?s=russian+politics
  > 
  > 
  > 
  > 
  > 
  > 
  > 
  > Chuck Williams wrote:
  > 
  > > Dave, are you using MultiFieldQueryParser and DefaultSimilarity
for
  > the
  > > vanilla implementation?
  > >
  > > It's important to know what we are comparing...
  > >
  > > Chuck
  > >
  > >   > -----Original Message-----
  > >   > From: David Spencer [mailto:[EMAIL PROTECTED]
  > >   > Sent: Friday, January 28, 2005 3:38 PM
  > >   > To: Lucene Developers List
  > >   > Subject: Re: Scoring benchmark evaluation. Was RE: How to
proceed
  > > with
  > >   > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
  > >   >
  > >   > Daniel Naber wrote:
  > >   >
  > >   > > On Friday 28 January 2005 22:45, Chuck Williams wrote:
  > >   > >
  > >   > >
  > >   > >>The fact that is requires all terms in all
  > >   > >>fields is part of the problem.  Once that is addressed,
another
  > >   > problem
  > >   > >>is that Lucene does not provide a good mechanis
  > >   > >
  > >   > >
  > >   > > That's fixed in CVS, so maybe the CVS version should be used
for
  > > the
  > >   > > evaluation. I think it should be robust.
  > >   >
  > >   > Hmmm, is it safe to assume I can build the index w/ lucene-
  > 1.4.3.jar
  > > but
  > >   >    deploy the webapp for searching w/ lucene-1.5-rc1-dev.jar?
  > >   >
  > >   > And is the current code supposed to build with so many
deprecated
  > >   > warnings?
  > >   >
  > >   > - Dave
  > >   >
  > >   > >
  > >   > > Regards
  > >   > >  Daniel
  > >   > >
  > >   >
  > >   >
  > >   >
  > >
---------------------------------------------------------------------
  > >   > To unsubscribe, e-mail:
[EMAIL PROTECTED]
  > >   > For additional commands, e-mail: lucene-dev-
  > [EMAIL PROTECTED]
  > >
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to