David,

Thanks for taking the lead on this!

You have two fields for this collection, title and body, right?

I'd like to configure this to use my DistributingMultiFieldQueryParser,
MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity.
DistributingMultiFieldQueryParser has a simple API -- it generates the
uses of MaxDisjunctionQuery.  So these should be easy to integrate.

What version of Java are you using?  My version of MaxDisjunctionQuery
requires Java 1.5. although another user modified it to work with Java
1.4.  If you are using 1.4, please let me know so I can look over the
modified version.

All the classes except the Similarity are already posted
(http://issues.apache.org/bugzilla/show_bug.cgi?id=32674).  I need to
think about what field boosts to use for this collection with
DistributingMultiFieldQueryParser, and need to think about tweaking the
Similarity (as my database is substantially different from the
wikipedia).

Side by side comparisons make sense to me.  It would be helpful if you
could provide an explain button so we can compare the score components
and tune.

If this makes sense, I should be able to get the remaining pieces to you
by tomorrow or this weekend.

Chuck

  > -----Original Message-----
  > From: David Spencer [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, January 27, 2005 2:36 PM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
  > Similarity.docFreq() ?
  > 
  > Doug Cutting wrote:
  > 
  > > Chuck Williams wrote:
  > >
  > >> Christoph Goller writes:
  > >>   > You may be right. But I am not completely convinced. I think
  > >>   > this should be decided based on the proposed benchmark
evaluation.
  > >>
  > >> Is that still happening?
  > >
  > >
  > > Like anything else in an all-volunteer operation, it will only
happen
  > if
  > > folks volunteer to do it.  Someone needs to take the lead and
index a
  > > reference collection with a couple of different Similarity
  > > implementations and post the code and the results of various
searches
  > > for folks to evaluate.  Chuck?
  > 
  > In theory I can probably easily do this, esp if someone would submit
  > another Similarity implementation.
  > 
  > The corpus easiest for me to use is the subset of the English
Wikipedia
  > I've been playing with. It has 400k documents..let's see, max length
of
  > a body is 258kb, avg len of non-trival entries (size > 100 chars) is
  > 2450 chars, and std dev is 3400 chars. I'm using "wikipedia
namespace 0"
  > which means the normal encyclopedia pages and not things like
chatlogs,
  > help  pages, or whatnot.
  > 
  > I recently made a demo page of the MoreLikeThis similarity query
  > generator + related algorithms (confusion alert, 'similarity' means
  > "show me documents similar to another doc", is implemented on top of
  > Lucene, and is not the same as
org.apache.lucene.search.Similarity...)
  > 
  > The page runs 3 algorithms in parallel and displays them on 1 page.
  > 
  > Here's a page that shows the 3 cols, 1 per alg:
  > 
  > http://www.searchmorph.com/kat/wikipedia-
  > compare.jsp?s=Information_retrieval
  > 
  > And you get there from a normal wikipedia search, click on "cmp" on
the
  > right of one of the matching docs:
  > 
  > http://www.searchmorph.com/kat/wikipedia.jsp?s=information+retrieval
  > 
  > Oh, and the relevance to this thread is, I'm assuming this is what
we
  > want to compare the different Similarity implementations, an easy
way of
  > seeing how they perform against a given query.
  > 
  > So, moving forward, if anyone agrees in general with me:
  > 
  > [1] Post some reasonable/interesting Similarity implementations
  > [2] Confirm that it makes sense to compare them on 1 screen "in
  > parallel"
  > 
  > thx,
  >   Dave
  > 
  > 
  > 
  > 
  > 
  > 
  > >
  > > Doug
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to