Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Doug Cutting
Wolf Siberski wrote: The price is an extension (or modification) of the Searchable interface. I've added corresponding search(Weight...) methods to the existing search(Query...) methods and deprecated the latter. I think this is the right solution. If Searchable is meant to be Lucene internal, then

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Wolf Siberski
Doug Cutting wrote: Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight(). Or

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-21 Thread Doug Cutting
Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight(). Orignally there was no

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-18 Thread Wolf Siberski
Doug Cutting wrote: Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query. coord/tf/slopp

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-10 Thread Wolf Siberski
Christoph Goller wrote: > Chuck Williams wrote: >> score(query, doc) = >> coord*queryNorm* >> sum[ term in query : >> idf(term)*boost(term)*idf(term)*tf(term, doc)*docNorm(doc) >>] >> >> where queryNorm = 1/sum[ term in query : (boost(term)*idf(term))^2 ] >> [...] The MultiSearcher bo

RE: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread Chuck Williams
TECTED] > Sent: Monday, February 07, 2005 3:36 PM > To: Lucene Developers List > Subject: Re: single field code ready - Re: URL to compare 2 Similarity's > ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher proble

Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread David Spencer
Daniel Naber wrote: On Tuesday 08 February 2005 00:06, David Spencer wrote: So, does this make sense and is it useful way of trying to evaluate the Similarities? Is this the MultiFieldQueryParser from Lucene 1.4? I see WEB-INF/lib/lucene-1.5-rc1-dev.jar dated Jan 28, though I'm not sure if that

Re: single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread Daniel Naber
On Tuesday 08 February 2005 00:06, David Spencer wrote: > So, does this make sense and is it useful way of trying to evaluate the > Similarities? Is this the MultiFieldQueryParser from Lucene 1.4? Then it's "buggy" anyway, so it probably doesn't make sense to test it. But even with the current

single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-07 Thread David Spencer
e (q6/q8). So, does this make sense and is it useful way of trying to evaluate the Similarities? I think another thread w/ a different thread has started on this topic, I'll try to redirect it back here. thx, Dave My $0.02, Chuck > -Original Message----- > From: David Spencer [

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-02 Thread Chuck Williams
Paul Elschot wrote: > On Wednesday 02 February 2005 03:38, Chuck Williams wrote: > > I was hoping to do this > > by simple thresholding, e.g. achieve a property like "results with all > > terms matched are always in [0.8, 1.0], and results missing a term > > always have a score less than

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-02 Thread Paul Elschot
On Wednesday 02 February 2005 03:38, Chuck Williams wrote: > Paul Elschot wrote: > > An alternative is to make sure all scores are bounded. > > Then the coordination factor can be implemented in the same bound > > while preserving the coordination order. > > If I understand this, I think mor

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Chuck Williams
akarta.apache.org > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? > > Doug, > > On Tuesday 01 February 2005 20:05, Doug Cut

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Paul Elschot
Doug, On Tuesday 01 February 2005 20:05, Doug Cutting wrote: > Chuck Williams wrote: > > > So I think this can be implemented using the expansion I proposed > > > yesterday for MultiFieldQueryParser, plus something like my > > > DensityPhraseQuery and perhaps a few Similarity tweaks. > > >

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Chuck Williams
AM > To: Lucene Developers List > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? > > Chuck Williams wrote: > > > So I

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
David Spencer wrote: Let's start with the issue that's been raised so much: whether idf is better defined with log() or sqrt(log()). I can redo my page and rebuild indexes if necessary, I just need it clarified what we want to do, esp -> does the index need to be rebuilt? The index needs to be r

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Chuck Williams
27;m proposing for Default-OR) should be separate. My $0.02, Chuck > -Original Message- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 10:59 AM > To: Lucene Developers List > Subject: Re: URL to compare 2 Similarity's re

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
Chuck Williams wrote: > So I think this can be implemented using the expansion I proposed > yesterday for MultiFieldQueryParser, plus something like my > DensityPhraseQuery and perhaps a few Similarity tweaks. I don't think that works unless the mechanism is limited to default-AND (i.e., all

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5 This loo

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Chuck Williams
Doug Cutting wrote: > That's a lot of functionality bundled into a single Query class! I'd > rather make it possible to assemble this from reusable parts. And it > almost can be already. Then we can offer such a thing pre-packaged. That would be great, if it could be done. > So let me t

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
David Spencer wrote: +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5 This looks great to me! I'd

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
Chuck Williams wrote: Doug Cutting wrote: > What did you think of my DensityPhraseQuery proposal? It is a step in the direction of what I have in mind, but I'd like to go further. How about a query class with these properties: 1. Inputs are: a. F = list of fields b. B = list of

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. David, This looks great! Thanks for doing this. Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
cene Developers List > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? > > Chuck Williams wrote: > > That expansion is scalable, but

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote: That expansion is scalable, but it only accounts for proximity of all query terms together. E.g., it does not favor a match where t1 and t2 are close together while t3 is distant over a match where all 3 terms are distant. Worse, it would not favor a match with t1 and t2 in

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: But what is right if there are > 2 terms in terms of the phrases - does it have a phrase for every pair of terms like this (ignore fields and boosts and proximity for a sec): search for "t1 t2 t3" gives you these phrases in addition to the direct field m

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
David Spencer wrote: But what is right if there are > 2 terms in terms of the phrases - does it have a phrase for every pair of terms like this (ignore fields and boosts and proximity for a sec): search for "t1 t2 t3" gives you these phrases in addition to the direct field matches: "t1 t2" "t2

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chris Lamprecht
> frequently do not include all query terms. I just tried this bizarre > query: > hilbert space frank zappa george bush john kerry > > There are two hits and they do not appear to have all terms (even in the It could be that the anchor text pointing to these pages from some other web page had t

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. David, This looks great! Thanks for doing this. Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Andrzej Bialecki
Folks, In the light of this discussion, I'm working slowly on a new release of Luke, which will include a BeanShell-driven Similarity designer. However, this particular module is not finished yet... given my current workload, this will take a week or two more... -- Best regards, Andrzej Bialeck

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
n error accessing the cache on the first hit). This included looking at the page source. Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Monday, January 31, 2005 1:44 PM > To: Lucene Developers List > Subject: Re: URL to compar

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. David, This looks great! Thanks for doing this. Thank you...it involved lots of back & forth interactions w/ Chuck over the few days to get it t

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote: I think the differences are pretty clear as the systems stands. Notice a substantial difference in the idf's in the respective explanations. I continue to think the current mechanism weights these too high, primarily due to its squaring. The other big difference occurs when

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
Doug Cutting wrote: > Is the default operator AND or OR? It appears to be OR, but it should > probably be AND. That's become the industry standard since QueryParser > was first written. Also, any chance we can get explanations for hits? Explanations are available. Click the score link on

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Doug Cutting wrote: It would translate a query "t1 t2" given fields f1 and f2 into something like: +(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) Oops. The first term on that line should be "f1:t2", not "f2:t1": +(f1:t2^b1 f2:t2^b2) f1:"t1 t2"~s1^b3 f2:"t1 t2"~s2^b4 Doug -

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
David Spencer wrote: I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. David, This looks great! Thanks for doing this. Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry s

RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Chuck Williams
larity for > the > > vanilla implementation? > > > > It's important to know what we are comparing... > > > > Chuck > > > > > -Original Message- > > > From: David Spencer [mailto:[EMAIL PROTECTED] > > >

URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread David Spencer
vid Spencer [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 3:38 PM > To: Lucene Developers List > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > Daniel Naber wrote: >

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Miles Barr
> On Fri, 2005-01-28 at 21:42 +, Chuck Williams wrote: > I just posted WikipediaSimilarity to Bug 32674. I've also reviewed and > tested the port to Java 1.4 -- it's fine (although all the casts remind > me why I like 1.5 so much). Thanks to Miles Barr for this port! Not a problem, cheers fo

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-29 Thread Daniel Naber
On Saturday 29 January 2005 00:37, David Spencer wrote: > Hmmm, is it safe to assume I can build the index w/ lucene-1.4.3.jar but >deploy the webapp for searching w/ lucene-1.5-rc1-dev.jar? Yes, everything else would be a bug. > And is the current code supposed to build with so many depreca

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread David Spencer
everything is spelled out. Chuck > -Original Message- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 3:38 PM > To: Lucene Developers List > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - Mul

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
> To: Lucene Developers List > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > Daniel Naber wrote: > > > On Friday 28 January 2005 22:45, Chuck Williams wrote: > > &

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread David Spencer
Daniel Naber wrote: On Friday 28 January 2005 22:45, Chuck Williams wrote: The fact that is requires all terms in all fields is part of the problem. Once that is addressed, another problem is that Lucene does not provide a good mechanis That's fixed in CVS, so maybe the CVS version should be use

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
TED] > Sent: Friday, January 28, 2005 3:21 PM > To: Lucene Developers List > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > On Friday 28 January 2005 22:45, Chuck Williams wrot

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Daniel Naber
On Friday 28 January 2005 22:45, Chuck Williams wrote: > The fact that is requires all terms in all > fields is part of the problem. Ă‚Once that is addressed, another problem > is that Lucene does not provide a good mechanis That's fixed in CVS, so maybe the CVS version should be used for the eva

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
r [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 1:44 PM > To: Lucene Developers List > Subject: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > On Friday 28 January 2005 17:53,

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
ct: Re: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? > > On Friday 28 January 2005 17:53, Chuck Williams wrote: > > > I think the baseline should use Lucene's MultiFieldQueryParser to &

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Daniel Naber
On Friday 28 January 2005 17:53, Chuck Williams wrote: > I think the baseline should use Lucene's MultiFieldQueryParser to expand > the query to search both title and body fields, as this is presumably > the current "out-of-the-box" solution. Please remember that this is kind of buggy in Lucene 1

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
uck > -Original Message- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 8:53 AM > To: Lucene Developers List > Subject: RE: Scoring benchmark evaluation. Was RE: How to proceed with > Bug 31841 - MultiSearcher problems wit

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
ery vector factor that determines the normalization and the idf should remain in the normalization. Chuck > -Original Message- > From: Christoph Goller [mailto:[EMAIL PROTECTED] > Sent: Friday, January 28, 2005 1:29 AM > To: Lucene Developers List > Subject: R

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Doug Cutting
Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query. coord/tf/sloppyFreq computation wo

RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Chuck Williams
David Spencer wrote: > I'm on JDK 1.4.2_06 and Tomcat 4+. Had issues w/ the Tomcat 5.5+/JDK 1.5 > combo so I rolled back. There have been issues with Tomcat 5.5, although supposedly the latest version has them resolved. I'm using Tomcat 5.0.28 with JDK 1.5.0_01, which has been solid -- no

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Christoph Goller
Chuck Williams schrieb: Actually, the normalize is a third idf factor (in a different form, square-rooted in the denominator and summed). I.e., for a simple BoolanQuery: score(query, doc) = coord*queryNorm* sum[ term in query : idf(term)*boost(term)*idf(term)*tf(term, doc)*docNorm(d

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread David Spencer
ursday, January 27, 2005 2:36 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Doug Cutting wrote: > > > Chuck Williams wrote: > > > >> Christoph

Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread David Spencer
be "n" indexes and wikipedia-sim.jsp will search in each one with the corresponding Similarity? thx, Dave Chuck > -Original Message- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 27, 2005 2:36 PM > To: Lucene Developers List >

Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
ow or this weekend. Chuck > -Original Message- > From: David Spencer [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 27, 2005 2:36 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.do

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
: Thursday, January 27, 2005 11:08 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > Christoph Goller writes: > > > You may be right. But I a

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread David Spencer
Doug Cutting wrote: Chuck Williams wrote: Christoph Goller writes: > You may be right. But I am not completely convinced. I think > this should be decided based on the proposed benchmark evaluation. Is that still happening? Like anything else in an all-volunteer operation, it will only happen

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Doug Cutting
Chuck Williams wrote: Christoph Goller writes: > You may be right. But I am not completely convinced. I think > this should be decided based on the proposed benchmark evaluation. Is that still happening? Like anything else in an all-volunteer operation, it will only happen if folks volunteer t

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
Christoph Goller writes: > Chuck Williams schrieb: > > Christoph Goller writes: > > > My intention was to (ab-)use query boosts for idf transmission and > to > > > overwrite Similarity so that local idf is ignored. The idea was to > > > simply multiply global idf into the given bo

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Christoph Goller
Chuck Williams schrieb: Christoph Goller writes: > My intention was to (ab-)use query boosts for idf transmission and to > overwrite Similarity so that local idf is ignored. The idea was to > simply multiply global idf into the given boost. Unfortunately idf is > not only used with the boos

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Chuck Williams
ary 27, 2005 3:36 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Wolf Siberski schrieb: > > This is more or less how the patch I already submitted works > > (exc

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Christoph Goller
Wolf Siberski schrieb: This is more or less how the patch I already submitted works (except that it ignored the query rewriting step). The problem I see with this now is that if I (ab-)use the Similarity class for idf transmission, it can't be redefined anymore by a user who wants to use a custom S

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Wolf Siberski
Christoph Goller wrote: [...] I also think this is the best way to fix this bug. However there may be a way to implement this while avoiding to change the Weight and Searchable API. The idea is to rewrite the query in MultiSearcher and while rewriting compile the global idf into the query boosts. F

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-26 Thread Christoph Goller
Wolf Siberski schrieb: Doug, Chuck, thanks for your feedback, proposals and explanations. The way to proceed seems quite clear to me now. Due to other obligations it will take probably about two to three weeks until I've implemented a new patch. I'll get back to you as soon as it's finished. --Wolf

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-17 Thread Wolf Siberski
Doug, Chuck, thanks for your feedback, proposals and explanations. The way to proceed seems quite clear to me now. Due to other obligations it will take probably about two to three weeks until I've implemented a new patch. I'll get back to you as soon as it's finished. --Wolf --

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Chuck Williams
nt: Friday, January 14, 2005 9:33 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > Doug Cutting wrote: > > > It would indeed be nice to be

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread mark harwood
>>For example, the RPC made by its > rewrite() > implementation could also return the docFreq() of > each term in the > rewritten query I haven't been following the remoting conversation in detail bit this may be relevant: Using the associated docFreq of each expanded term is not particularly be

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Chuck Williams wrote: Doug Cutting wrote: > It would indeed be nice to be able to short-circuit rewriting for > queries where it is a no-op. Do you have a proposal for how this could > be done? First, this gets into the other part of Bug 31841. I don't believe MultiSearcher.rewrite() is eve

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Wolf Siberski wrote: Doug Cutting wrote: So, when a query is executed on a MultiSearcher of RemoteSearchables, the following remote calls are made: 1. RemoteSearchable.rewrite(Query) is called After that step, are wildcards replaced by term lists? Yes. I haven't taken a look at the rewrite() met

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Wolf Siberski
Doug Cutting wrote: Chuck Williams wrote: I think the question is how frequent and how expensive would those two steps be in comparison to the difference in the query processing. I think the first question is: can we get RemoteSearchables to work correctly and reasonably efficiently for simple que

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Wolf Siberski
Doug Cutting wrote: So, when a query is executed on a MultiSearcher of RemoteSearchables, the following remote calls are made: 1. RemoteSearchable.rewrite(Query) is called After that step, are wildcards replaced by term lists? I haven't taken a look at the rewrite() methods. Could you explain to

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Wolf Siberski
Doug Cutting wrote: Wolf Siberski wrote: In the new context, the searcher would be a MultiSearcher, and to resolve that call at on of the RemoteSearchables, the method getSimilarity() would have to be called remotely on it. I think this can be handled by: a. declaring TermQuery.searcher transient -

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
ine() for all query types (which is greatly simplified by a good default implementation). Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 11:41 AM > To: Lucene Developers List > Subject: Re: How to p

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: If auto-filters can provide an effective implementation for RangeQuery's that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery in the distributed environment, then how about something like this refinement: 1. No rewriting is done. It would indeed be nice

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 10:29 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > It just s

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Paul Elschot
On Thursday 13 January 2005 19:29, Doug Cutting wrote: > Chuck Williams wrote: > > It just seems like a lot of IPC activity for each query. As things > > stand now, I think you are proposing this? > > 1. MultiSearcher calls the remote node to rewrite the query, > > requiring serialization of th

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: It just seems like a lot of IPC activity for each query. As things stand now, I think you are proposing this? 1. MultiSearcher calls the remote node to rewrite the query, requiring serialization of the query. 2. The remote node returns the rewritten query to the dispatc

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
y processing. Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 9:14 AM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > &g

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: I think there is another problem here. It is currently the Weight implementations that do rewrite(), which requires access to the index, not just to the idf's. E.g., RangeQuery.rewrite() must find the terms in the index within the range. So, the Weight cannot be computed in

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Wolf Siberski wrote: Yes, I agree. I just wanted to point out that the current Weight implementations need to be modified heavily to introduce the behaviour you describe above. For example, take a look at TermQuery.TermWeight.scorer(): [...] return new TermScorer(this, termDocs, getSimilarity

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Chuck Williams
--Original Message- > From: Paul Elschot [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 13, 2005 12:18 AM > To: lucene-dev@jakarta.apache.org > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > On Thursday 13 J

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Paul Elschot
s. > Chuck > > > -Original Message- > > From: Wolf Siberski [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, January 12, 2005 4:08 PM > > To: Lucene Developers List > > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems > with &

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
uted until after the Query is rewritten, which requires access to the index on the remote node. Chuck > -Original Message- > From: Wolf Siberski [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 12, 2005 4:08 PM > To: Lucene Developers List > Subject: Re: How

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Wolf Siberski
Doug Cutting wrote: Wolf Siberski wrote: Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a m

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
Doug Cutting wrote: > Searchers are based on > IndexReaders, and hence doFreqs don't change until a new Searcher is > created. So long as this is true, and the central dispatch node uses a > searcher, then a simple cache, perhaps that is pre-fetched, is all > that's feasable. It shouldn

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Wolf Siberski wrote: Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a much better way to do

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: There needs to be a way to create the aggregate docFreq table and keep it current under incremental changes to the indices on the various remote nodes. I think you're getting ahead of yourself. Searchers are based on IndexReaders, and hence doFreqs don't change until a new S

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Wolf Siberski
Chuck Williams wrote: I've read through Wolf's patch and see a few issues (please correct anything wrong here): 1. DfMapSimilarity works only with a limited set of queries.[...] 2. The patch hardwires the use of DfMapSimilarity into MultiSearcher.[...] 3. Philosophically, I'm not convinced

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Chuck Williams
rom the aggregate table to address this issue and assuming a docFreq of 1). Is there a better way, or perhaps I'm missing something? Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 12, 2005 8:58 AM > To: Lucene

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: I was thinking of the aggressive version with an index-time solution, although I don't know the Lucene architecture for distributed indexing and searching well enough to formulate the idea precisely. Conceptually, I'd like each server that owns a slice of the index in a distri

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Chuck Williams
end the queries out to the remote Searcher's and these Searcher's could consult their local indexes for the correct docFreq's to use. Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 11, 2005 3:46 PM

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. Glad to hear it at least makes sense... Now I hope it works! I'm still left wondering if having MultiSearcher

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Chuck Williams
t; From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 11, 2005 1:13 PM > To: Lucene Developers List > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with > Similarity.docFreq() ? > > Chuck Williams wrote: > > As Wolf does, I

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: As Wolf does, I hope a committer with deep knowledge of Lucene's design in this area will weigh in on the issue and help to resolve it. The root of the bug is in MultiSearcher.search(). This should construct a Weight, weight the query, then score the now-weighted query. Her

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Chuck Williams
ing efficient. Is something along those lines possible? Chuck > -Original Message- > From: Wolf Siberski [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 11, 2005 12:55 AM > To: Lucene Developers List > Subject: How to proceed with Bug 31841 - MultiSearcher problems w

How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Wolf Siberski
As I'm very interested in resolving this bug, I would like to resume the discussion about it. Chuck Williams (the original bug reporter) and me both already have provided a patch. Is any of the committers willing to review them? If changes are necessary, or another way of handling this issue turns