Chuck Williams wrote:
David,
Thanks for taking the lead on this!
You're welcome!
You have two fields for this collection, title and body, right?
Yes.
I'd like to configure this to use my DistributingMultiFieldQueryParser, MaxDisjunctionQuery (and MaxDisjunctionScorer), and Similarity. DistributingMultiFieldQueryParser has a simple API -- it generates the uses of MaxDisjunctionQuery. So these should be easy to integrate.
What version of Java are you using? My version of MaxDisjunctionQuery requires Java 1.5. although another user modified it to work with Java 1.4. If you are using 1.4, please let me know so I can look over the modified version.
I'm on JDK 1.4.2_06 and Tomcat 4+. Had issues w/ the Tomcat 5.5+/JDK 1.5 combo so I rolled back.
I was thinking the same thing.
All the classes except the Similarity are already posted (http://issues.apache.org/bugzilla/show_bug.cgi?id=32674). I need to think about what field boosts to use for this collection with DistributingMultiFieldQueryParser, and need to think about tweaking the Similarity (as my database is substantially different from the wikipedia).
Side by side comparisons make sense to me. It would be helpful if you could provide an explain button so we can compare the score components
OK - I'll try to catch up on the meaning of DistributingMultiFieldQueryParser et. al. but what's easiest for me is for you to tell me how to turn the String search argument into a Query.and tune.
If this makes sense, I should be able to get the remaining pieces to you by tomorrow or this weekend.
So I'll have a page named something like wikipedia-sim.jsp.
It'll be invoked as "wikipedia-sim.jsp?s=big+dog" and the impl will have to turn "big dog" into a Query.
The baseline will presumably use the default Lucene Similarity and Query Parser.
For "your case" you'll need to tell me if I need to call DistributingMultiFieldQueryParser or whatnot.
Also, dumb question...do I need to build an index for every impl of Similarity? Thus there will be "n" indexes and wikipedia-sim.jsp will search in each one with the corresponding Similarity?
thx, Dave
Chuck
> -----Original Message-----
> From: David Spencer [mailto:[EMAIL PROTECTED]
> Sent: Thursday, January 27, 2005 2:36 PM
> To: Lucene Developers List
> Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems
with
> Similarity.docFreq() ?
> > Doug Cutting wrote:
> > > Chuck Williams wrote:
> >
> >> Christoph Goller writes:
> >> > You may be right. But I am not completely convinced. I think
> >> > this should be decided based on the proposed benchmark
evaluation.
> >>
> >> Is that still happening?
> >
> >
> > Like anything else in an all-volunteer operation, it will only
happen
> if
> > folks volunteer to do it. Someone needs to take the lead and
index a
> > reference collection with a couple of different Similarity
> > implementations and post the code and the results of various
searches
> > for folks to evaluate. Chuck?
> > In theory I can probably easily do this, esp if someone would submit
> another Similarity implementation.
> > The corpus easiest for me to use is the subset of the English
Wikipedia
> I've been playing with. It has 400k documents..let's see, max length
of
> a body is 258kb, avg len of non-trival entries (size > 100 chars) is
> 2450 chars, and std dev is 3400 chars. I'm using "wikipedia
namespace 0"
> which means the normal encyclopedia pages and not things like
chatlogs,
> help pages, or whatnot.
> > I recently made a demo page of the MoreLikeThis similarity query
> generator + related algorithms (confusion alert, 'similarity' means
> "show me documents similar to another doc", is implemented on top of
> Lucene, and is not the same as
org.apache.lucene.search.Similarity...)
> > The page runs 3 algorithms in parallel and displays them on 1 page.
> > Here's a page that shows the 3 cols, 1 per alg:
> > http://www.searchmorph.com/kat/wikipedia-
> compare.jsp?s=Information_retrieval
> > And you get there from a normal wikipedia search, click on "cmp" on
the
> right of one of the matching docs:
> > http://www.searchmorph.com/kat/wikipedia.jsp?s=information+retrieval
> > Oh, and the relevance to this thread is, I'm assuming this is what
we
> want to compare the different Similarity implementations, an easy
way of
> seeing how they perform against a given query.
> > So, moving forward, if anyone agrees in general with me:
> > [1] Post some reasonable/interesting Similarity implementations
> [2] Confirm that it makes sense to compare them on 1 screen "in
> parallel"
> > thx,
> Dave
> > > > > > > >
> > Doug
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
[EMAIL PROTECTED]
> >
> > >
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]