Re: Proposal: Scorer api change

2010-06-09 Thread Paul Elschot
Op woensdag 09 juni 2010 14:40:49 schreef Shai Erera: > So just to make sure I understand: > > A Matcher is paired w/ a Scorer, and this pairing is done at Query > construction time ... e.g. if I use QP to construct the Query, I'd need to > extend QP by providing my custom scorer for relevant Matc

Re: Proposal: Scorer api change

2010-06-09 Thread John Wang
Doron: I hadn't considered the general implications of scoring arbitrary docs until Earwin brought it up. I agree the can of warms it would open-up... Thoughts about: public float score(DocIdSetIterator iter); ? Just thinking out load. +++ on Query = Matcher + Scorin

Re: Proposal: Scorer api change

2010-06-09 Thread Shai Erera
So just to make sure I understand: A Matcher is paired w/ a Scorer, and this pairing is done at Query construction time ... e.g. if I use QP to construct the Query, I'd need to extend QP by providing my custom scorer for relevant Matchers (and reuse the scorers logic for the other fragments), and

Re: Proposal: Scorer api change

2010-06-09 Thread Earwin Burrfoot
On Wed, Jun 9, 2010 at 15:39, Doron Cohen wrote: > I think you'd still not modify a nicely extendible/wrapable API just to > avoid the extra call, unless benchmarking shows that the cost is high. Current Query API is NOT nicely extensible :) Look above for BM25BooleanQuery mention. -- Kirill Za

Re: Proposal: Scorer api change

2010-06-09 Thread Earwin Burrfoot
What I have in mind is basically having two parallel trees - one for matching, one for scoring. Matching tree is completely independent and can be used as a filter with sort-by-field approach, for example. Scoring tree nodes have references to corresponding matching tree nodes, so they can exploit

Re: Proposal: Scorer api change

2010-06-09 Thread Shai Erera
I don't feel comfortable with the statement "these visitors are then free to specialize on matchers or not ...". Let's think how this API will be used .. today, the user has two hooks - the QueryParser and Collector. Collector allows you to plug in your own and by extending QP you can return your o

Re: Proposal: Scorer api change

2010-06-09 Thread Doron Cohen
> > If you can fix a hotspot in Lucene to avoid an extra method call, an > extra add/multiply, etc., you should. Doing so ensures the cost can't > be there. Not doing so means you rely on the JRE to be smart enough, > and it very easily may not be (there are so many variables), and that > also ma

Re: Proposal: Scorer api change

2010-06-09 Thread Earwin Burrfoot
> Can we represent the Query > state in some general structure, that no matter which Query you get, you'll > know how to score it? No. You could go for unified interface that allows you to express different query states, like a set of untyped key-values, but you'll end up switching on these keyval

Re: Proposal: Scorer api change

2010-06-09 Thread Shai Erera
Ok point taken - don't trust on the JVM ! I don't trust it either. So for a TermQuery, which needs to evaluate 1M docs, you add 1M nextDoc calls w/ the delegate approach. But for a BQ, that's not the case. You add one method call which can be followed by a series of nextDoc/advance calls by the su

Re: Proposal: Scorer api change

2010-06-09 Thread Michael McCandless
I generally don't trust the compiler, if/when I have that freedom. If you can fix a hotspot in Lucene to avoid an extra method call, an extra add/multiply, etc., you should. Doing so ensures the cost can't be there. Not doing so means you rely on the JRE to be smart enough, and it very easily ma

Re: Proposal: Scorer api change

2010-06-09 Thread Earwin Burrfoot
Lies, lies, lies :) I mean, Sun JIT is overrelied on. Especially in regards to inlining. But, there are some cases when you can trust it. I.e. if you call a virtual method and this exact call-site gets refs to different objects at runtime (meaning here - you wrap different Queries in your WrapperQ

Re: Proposal: Scorer api change

2010-06-09 Thread Doron Cohen
Hi John, I think there are two aspects to the modified API suggestion: (1) allow custom scoring with less delegation calls overhead. (2) support arbitrary doc scoring with each scorer (nicely put by Earwin) allowing scoring docs also not in docid order. Point (2) is nice feature. How largely is

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
re: "The compiler does make optimizations and inlines such code/calls if it can" Are you really sure of this in THIS case? Can you elaborate WHEN such inline optimizations happen and how it applies here? This sounds to me like a very vague and irresponsible statement. Many java literature do not

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
I agree w/ both Doron and Earwin, on different points though. I don't think the method call is an overhead John. You don't need to reiterate it. The compiler does make optimizations and inlines such code/calls if it can. More than that, the query processing involves so much method calls, that I do

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
Some people don't do IO while searching at all. When you're over certain qps/index size threshold, you need less nodes to keep all your index (or its hot parts) in memory, than to keep combined IO subsystem throughput high enough to satisfy disc-based search demands. 2010/6/9 Doron Cohen : > I too

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
With your proposed API you HAVE to support arbitrary doc scoring with each scorer. This can easily lead to heaps of complex, yet rarely-used code, as most people will still use score-only-current-doc approach, and this will invariably produce optimized shortcuts. MG4J approach, on the other hand,

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Hi Doron: Re: " comparing to all other IO ops and computations done by the stack of scorers" Lucene caches rather well and compresses well enough that the IO cache is effective enough that you are not really paying for disk movement most of the time. As for the stack of scores, that is actually m

Re: Proposal: Scorer api change

2010-06-08 Thread Doron Cohen
I too tend to ignore the overhead of delegated calls, especially comparing to all other IO ops and computations done by the stack of scorers, but accepting that you cannot ignore it, could you achieve the same goal by sub-classing the top query where you subclass its weight to return a sub-class of

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Wouldn't you get it as well with proposed api? You would still be able to iterate the doc and at that point call score with the docid. If you call score() along with iteration, you would still get the information no? Making scorer take a docid allows you score any docid in the reader if the query w

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
To compute a score you have to see which of your subqueries did not match, which did, and what are the docfreqs/positions for them. When iterating, and calling score() only for current doc - parts of this data (maybe even all of it, not sure) is already gathered for you. If you allow calling score(

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Shai: Java cannot inline in this case. Actually there is an urban legend around using final to hint to underlying compiler to inline :) (turns out to be false, one reason being dynamic classloading) write a simple pgm and try and see for yourself (remember to turn on -server on VM optio

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
What do you mean "we are not inlining"? The compiler inlines methods .. at least it tries. Shai On Tue, Jun 8, 2010 at 8:21 PM, John Wang wrote: > Shai: > > method call overhead in this case is not insignificant because it is in > a very tight loop, and no, compiler cannot optimize it for y

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Hi Earwin: I am not sure I understand here, e.g. what si the difference between: float myscorinCode(){ computeMyScore(scorer.score()); } and float myscorinCode(){ computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID()); } In the

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Shai: method call overhead in this case is not insignificant because it is in a very tight loop, and no, compiler cannot optimize it for you, we are not inline-ing cuz we are in a java world. You are right, this breaks backward compatibility. But from 2.4 - 2.9, we have done MUCH worse.

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
Yeah I got what he meant, but I honestly don't think those delegate calls are an overhead ... Shai On Tue, Jun 8, 2010 at 8:12 PM, Earwin Burrfoot wrote: > Shai, his wrapper Scorer will just look like: > DISI getDISI() { > return delegate.getDISI(); > } > > float score(int doc) { > return cal

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
Shai, his wrapper Scorer will just look like: DISI getDISI() { return delegate.getDISI(); } float score(int doc) { return calcMyAwesomeScore(doc); } this saves delegate.nextDoc(), delegate.advance() indirection calls. But I already offered a better alternative :) On Tue, Jun 8, 2010 at 21:09

Re: Proposal: Scorer api change

2010-06-08 Thread Earwin Burrfoot
The problem with your proposal is that, currently, Lucene uses current iteration state to compute score. I.e. it already knows which of SHOULD BQ clauses matched for current doc, so it's easier to calculate the score. If you change API to allow scoring arbitrary documents (even those that didn't ma

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
I guess I must be missing something fundamental here :). If Scorer is defined as you propose, and I create my Scorer which impls getDISI() as "return this" - what do I lose? What's wrong w/ Scorer already being a DISI? You mention "it is just inefficient to pay the method call overhead ..." - wha

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
re: But Scorer is itself an iterator, so what prevents you from calling nextDoc and advance on it without score() Nothing. It is just inefficient to pay the method call overhead just to overload score. re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ is not good enough I'd

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
Well … I don't know the reason as well and always thought Scorer and Similarity are confusing. But Scorer is itself an iterator, so what prevents you from calling nextDoc and advance on it without score(). And what would the returned DISI do when nextDoc is called, if not delegate to its subs? If

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Hi Shai: I am not sure I understand how changing Similarity would solve this problem, wouldn't you need the reader? As for PayloadTermQuery, payload is not always the most efficient way of storing such data, especially when number of terms << numdocs. (I am not sure accessing the payload

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
So wouldn't it make sense to add some method to Similarity? Which receives the doc Id in question maybe ... just thinking here. Factoring Scorer like you propose would create 3 objects for scoring/iterating: Scorer (which really becomes an iterator), Similarity and CustomScoreFunction ... Maybe y

Re: Proposal: Scorer api change

2010-06-08 Thread John Wang
Hi Shai: Similarity in many cases is not sufficient for scoring. For example, to implement age decaying of a document (very useful for corpuses like news or tweets), you want to project the raw tfidf score onto a time curve, say f(x), to do this, you'd have a custom scorer that decorates the u

Re: Proposal: Scorer api change

2010-06-08 Thread Shai Erera
I'm not sure I understand what you mean - Scorer is a DISI itself, and the scoring formula is mostly controlled by Similarity. What will be the benefits of the proposed change? Shai On Tue, Jun 8, 2010 at 8:25 AM, John Wang wrote: > Hi guys: > > I'd like to make a proposal to change the Sc

Proposal: Scorer api change

2010-06-07 Thread John Wang
Hi guys: I'd like to make a proposal to change the Scorer class/api to the following: public abstract class Scorer{ DocIdSetIterator getDocIDSetIterator(); float score(int docid); } Reasons: 1) To build a Scorer from an existing Scorer (e.g. that produces raw scores from tfidf), one