Some people don't do IO while searching at all. When you're over certain qps/index size threshold, you need less nodes to keep all your index (or its hot parts) in memory, than to keep combined IO subsystem throughput high enough to satisfy disc-based search demands.
2010/6/9 Doron Cohen <cdor...@gmail.com>: > I too tend to ignore the overhead of delegated calls, especially comparing > to all other IO ops and computations done by the stack of scorers, but > accepting that you cannot ignore it, could you achieve the same goal by > sub-classing the top query where you subclass its weight to return a > sub-class of its scorer which would only override score() but not the other > methods, and in score would apply that eg decay logic? This way no > delegation is required for the other methods. A disadvantage of this is that > you would need subclass like this any kind of top level query that might > come up in your app - so not sure if this is really acceptable in your case. > Another disadvantage is that this is a much more complicated code to write. > > Doron > > 2010/6/8 John Wang <john.w...@gmail.com> >> >> Wouldn't you get it as well with proposed api? >> You would still be able to iterate the doc and at that point call score >> with the docid. If you call score() along with iteration, you would still >> get the information no? >> Making scorer take a docid allows you score any docid in the reader if the >> query wants it to. Wouldn't it make it more flexible? >> -John >> >> On Tue, Jun 8, 2010 at 10:54 AM, Earwin Burrfoot <ear...@gmail.com> wrote: >>> >>> To compute a score you have to see which of your subqueries did not >>> match, which did, and what are the docfreqs/positions for them. >>> When iterating, and calling score() only for current doc - parts of >>> this data (maybe even all of it, not sure) is already gathered for >>> you. If you allow calling score(int doc) - for arbitrary docId, you'll >>> have to redo this work. >>> >>> 2010/6/8 John Wang <john.w...@gmail.com>: >>> > Hi Earwin: >>> > >>> > I am not sure I understand here, e.g. what si the difference >>> > between: >>> > >>> > float myscorinCode(){ >>> > computeMyScore(scorer.score()); >>> > } >>> > >>> > and >>> > >>> > float myscorinCode(){ >>> > >>> > computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID()); >>> > } >>> > >>> > In the case of BQ, when you get a hit, would you still be able to >>> > call >>> > subscorer.score(hit)? Why is the point of iteration important for BQ? >>> > >>> > please elaborate. >>> > >>> > Thanks >>> > >>> > -John >>> > >>> > On Tue, Jun 8, 2010 at 10:10 AM, Earwin Burrfoot <ear...@gmail.com> >>> > wrote: >>> >> >>> >> The problem with your proposal is that, currently, Lucene uses current >>> >> iteration state to compute score. >>> >> I.e. it already knows which of SHOULD BQ clauses matched for current >>> >> doc, so it's easier to calculate the score. >>> >> If you change API to allow scoring arbitrary documents (even those >>> >> that didn't match the query at all), you're opening a can of worms :) >>> >> >>> >> As an alternative, you can try looking at MG4J sources. As far as I >>> >> understand, their scoring is decoupled from matching, just like you >>> >> (and I bet many more people) want. The matcher is separate, and the >>> >> scoring entity accepts current matcher state instead of document id, >>> >> so you get the best of both worlds. >>> >> >>> >> On Tue, Jun 8, 2010 at 21:01, John Wang <john.w...@gmail.com> wrote: >>> >> > re: But Scorer is itself an iterator, so what prevents you from >>> >> > calling >>> >> > nextDoc and advance on it without score() >>> >> > >>> >> > Nothing. It is just inefficient to pay the method call overhead just >>> >> > to >>> >> > overload score. >>> >> > >>> >> > re: If I were in your shoes, I'd simply provider a Query wrapper. If >>> >> > CSQ >>> >> > is not good enough I'd just develop my own. >>> >> > >>> >> > That is what I am doing. I am just proposing the change (see my >>> >> > first >>> >> > email) >>> >> > as an improvement. >>> >> > >>> >> > re: Scorer is itself an iterator >>> >> > >>> >> > yes, that is the current definition. The point of the proposal is to >>> >> > make >>> >> > this change. >>> >> > >>> >> > -John >>> >> > >>> >> > On Tue, Jun 8, 2010 at 9:45 AM, Shai Erera <ser...@gmail.com> wrote: >>> >> >> >>> >> >> Well … I don't know the reason as well and always thought Scorer >>> >> >> and >>> >> >> Similarity are confusing. >>> >> >> >>> >> >> But Scorer is itself an iterator, so what prevents you from calling >>> >> >> nextDoc and advance on it without score(). And what would the >>> >> >> returned >>> >> >> DISI do when nextDoc is called, if not delegate to its subs? >>> >> >> >>> >> >> If I were in your shoes, I'd simply provider a Query wrapper. If >>> >> >> CSQ >>> >> >> is not good enough I'd just develop my own. >>> >> >> >>> >> >> But perhaps others think differently? >>> >> >> >>> >> >> Shai >>> >> >> >>> >> >> On Tuesday, June 8, 2010, John Wang <john.w...@gmail.com> wrote: >>> >> >> > Hi Shai: >>> >> >> > I am not sure I understand how changing Similarity would >>> >> >> > solve >>> >> >> > this >>> >> >> > problem, wouldn't you need the reader? >>> >> >> > As for PayloadTermQuery, payload is not always the most >>> >> >> > efficient >>> >> >> > way of storing such data, especially when number of terms << >>> >> >> > numdocs. >>> >> >> > (I am >>> >> >> > not sure accessing the payload when you iterate is a good idea, >>> >> >> > but >>> >> >> > that is >>> >> >> > another discussion) >>> >> >> > >>> >> >> > Yes, what I described is exactly a simple CustomScoreQuery >>> >> >> > for a >>> >> >> > special use-case. The problem is also in CustomScoreQuery, where >>> >> >> > nextDoc and >>> >> >> > advance are calling the sub-scorers as a wrapper. This can be >>> >> >> > avoided >>> >> >> > if the >>> >> >> > Scorer returns an iterator instead. >>> >> >> > >>> >> >> > Separating scoring and doc iteration is a good idea anyway. I >>> >> >> > don't >>> >> >> > know the reason to combine them originally. >>> >> >> > Thanks >>> >> >> > -John >>> >> >> > >>> >> >> > >>> >> >> > On Tue, Jun 8, 2010 at 8:47 AM, Shai Erera <ser...@gmail.com> >>> >> >> > wrote: >>> >> >> > >>> >> >> > So wouldn't it make sense to add some method to Similarity? Which >>> >> >> > receives the doc Id in question maybe ... just thinking here. >>> >> >> > >>> >> >> > Factoring Scorer like you propose would create 3 objects for >>> >> >> > scoring/iterating: Scorer (which really becomes an iterator), >>> >> >> > Similarity and >>> >> >> > CustomScoreFunction ... >>> >> >> > >>> >> >> > Maybe you can use CustomScoreQuery? or PayloadTermQuery? depends >>> >> >> > how >>> >> >> > you >>> >> >> > compute your age decay function (where you pull the data about >>> >> >> > the >>> >> >> > age of >>> >> >> > the document). >>> >> >> > >>> >> >> > Shai >>> >> >> > >>> >> >> > >>> >> >> > On Tue, Jun 8, 2010 at 6:41 PM, John Wang <john.w...@gmail.com> >>> >> >> > wrote: >>> >> >> > Hi Shai: >>> >> >> > Similarity in many cases is not sufficient for scoring. For >>> >> >> > example, >>> >> >> > to implement age decaying of a document (very useful for corpuses >>> >> >> > like news >>> >> >> > or tweets), you want to project the raw tfidf score onto a time >>> >> >> > curve, say >>> >> >> > f(x), to do this, you'd have a custom scorer that decorates the >>> >> >> > underlying >>> >> >> > scorer from your say, boolean query: >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > public float score(){ return myFunc(innerScorer.score());} >>> >> >> > This is fine, but then you would have to do this as well: >>> >> >> > public int nextDoc(){ >>> >> >> > >>> >> >> > >>> >> >> > return innerScorer.nextDoc();} >>> >> >> > and also: >>> >> >> > public int advance(int target){ return innerScorer.advance();} >>> >> >> > The difference here is that nextDoc and advance are called far >>> >> >> > more >>> >> >> > times as >>> >> >> > score. And you are introducing an extra method call for them, >>> >> >> > which >>> >> >> > is not >>> >> >> > insignificant for queries result in large recall sets. >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > Hope this makes sense. >>> >> >> > Thanks >>> >> >> > -John >>> >> >> > On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <ser...@gmail.com> >>> >> >> > wrote: >>> >> >> > I'm not sure I understand what you mean - Scorer is a DISI >>> >> >> > itself, >>> >> >> > and >>> >> >> > the scoring formula is mostly controlled by Similarity. >>> >> >> > >>> >> >> > What will be the benefits of the proposed change? >>> >> >> > >>> >> >> > Shai >>> >> >> > >>> >> >> > On Tue, Jun 8, 2010 at 8:25 AM, John Wang <john.w...@gmail.com> >>> >> >> > wrote: >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > Hi guys: >>> >> >> > >>> >> >> > I'd like to make a proposal to change the Scorer class/api to >>> >> >> > the >>> >> >> > following: >>> >> >> > >>> >> >> > >>> >> >> > public abstract class Scorer{ >>> >> >> > DocIdSetIterator getDocIDSetIterator(); >>> >> >> > float score(int docid); >>> >> >> > } >>> >> >> > >>> >> >> > Reasons: >>> >> >> > >>> >> >> > 1) To build a Scorer from an existing Scorer (e.g. that produces >>> >> >> > raw >>> >> >> > scores from tfidf), one would decorate it, and it would introduce >>> >> >> > overhead >>> >> >> > (in function calls) around nextDoc and advance, even if you just >>> >> >> > want >>> >> >> > to >>> >> >> > augment the score method which is called much fewer times. >>> >> >> > >>> >> >> > 2) The current contract forces scoring on the currentDoc in the >>> >> >> > underlying iterator. So once you pass "current", you can no >>> >> >> > longer >>> >> >> > score. In >>> >> >> > one of our use-cases, it is very inconvenient. >>> >> >> > >>> >> >> > What do you think? I can go ahead and open an issue and work on a >>> >> >> > patch >>> >> >> > if I get some agreement. >>> >> >> > >>> >> >> > Thanks >>> >> >> > >>> >> >> > -John >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> >>> >> >> >>> >> >> --------------------------------------------------------------------- >>> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >>> >> > >>> >> > >>> >> >>> >> >>> >> >>> >> -- >>> >> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) >>> >> Phone: +7 (495) 683-567-4 >>> >> ICQ: 104465785 >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >>> > >>> > >>> >>> >>> >>> -- >>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) >>> Phone: +7 (495) 683-567-4 >>> ICQ: 104465785 >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> > > -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org