Op woensdag 09 juni 2010 14:40:49 schreef Shai Erera:
> So just to make sure I understand:
>
> A Matcher is paired w/ a Scorer, and this pairing is done at Query
> construction time ... e.g. if I use QP to construct the Query, I'd need to
> extend QP by providing my custom scorer for relevant Matc
Doron:
I hadn't considered the general implications of scoring arbitrary docs
until Earwin brought it up. I agree the can of warms it would open-up...
Thoughts about:
public float score(DocIdSetIterator iter);
?
Just thinking out load.
+++ on Query = Matcher + Scorin
So just to make sure I understand:
A Matcher is paired w/ a Scorer, and this pairing is done at Query
construction time ... e.g. if I use QP to construct the Query, I'd need to
extend QP by providing my custom scorer for relevant Matchers (and reuse the
scorers logic for the other fragments), and
On Wed, Jun 9, 2010 at 15:39, Doron Cohen wrote:
> I think you'd still not modify a nicely extendible/wrapable API just to
> avoid the extra call, unless benchmarking shows that the cost is high.
Current Query API is NOT nicely extensible :)
Look above for BM25BooleanQuery mention.
--
Kirill Za
What I have in mind is basically having two parallel trees - one for
matching, one for scoring.
Matching tree is completely independent and can be used as a filter
with sort-by-field approach, for example.
Scoring tree nodes have references to corresponding matching tree
nodes, so they can exploit
I don't feel comfortable with the statement "these visitors are then free to
specialize on matchers or not ...". Let's think how this API will be used ..
today, the user has two hooks - the QueryParser and Collector. Collector
allows you to plug in your own and by extending QP you can return your o
>
> If you can fix a hotspot in Lucene to avoid an extra method call, an
> extra add/multiply, etc., you should. Doing so ensures the cost can't
> be there. Not doing so means you rely on the JRE to be smart enough,
> and it very easily may not be (there are so many variables), and that
> also ma
> Can we represent the Query
> state in some general structure, that no matter which Query you get, you'll
> know how to score it?
No. You could go for unified interface that allows you to express
different query states, like a set of untyped key-values, but you'll
end up switching on these keyval
Ok point taken - don't trust on the JVM ! I don't trust it either.
So for a TermQuery, which needs to evaluate 1M docs, you add 1M nextDoc
calls w/ the delegate approach. But for a BQ, that's not the case. You add
one method call which can be followed by a series of nextDoc/advance calls
by the su
I generally don't trust the compiler, if/when I have that freedom.
If you can fix a hotspot in Lucene to avoid an extra method call, an
extra add/multiply, etc., you should. Doing so ensures the cost can't
be there. Not doing so means you rely on the JRE to be smart enough,
and it very easily ma
Lies, lies, lies :)
I mean, Sun JIT is overrelied on. Especially in regards to inlining.
But, there are some cases when you can trust it. I.e. if you call a
virtual method and this exact call-site gets refs to different objects
at runtime (meaning here - you wrap different Queries in your
WrapperQ
Hi John,
I think there are two aspects to the modified API suggestion:
(1) allow custom scoring with less delegation calls overhead.
(2) support arbitrary doc scoring with each scorer (nicely put by Earwin)
allowing scoring docs also not in docid order.
Point (2) is nice feature. How largely is
re: "The compiler does make optimizations and inlines such code/calls if it
can"
Are you really sure of this in THIS case? Can you elaborate WHEN such inline
optimizations happen and how it applies here?
This sounds to me like a very vague and irresponsible statement.
Many java literature do not
I agree w/ both Doron and Earwin, on different points though.
I don't think the method call is an overhead John. You don't need to
reiterate it. The compiler does make optimizations and inlines such
code/calls if it can. More than that, the query processing involves so much
method calls, that I do
Some people don't do IO while searching at all. When you're over
certain qps/index size threshold, you need less nodes to keep all your
index (or its hot parts) in memory, than to keep combined IO subsystem
throughput high enough to satisfy disc-based search demands.
2010/6/9 Doron Cohen :
> I too
With your proposed API you HAVE to support arbitrary doc scoring with
each scorer.
This can easily lead to heaps of complex, yet rarely-used code, as
most people will still use score-only-current-doc approach, and this
will invariably produce optimized shortcuts.
MG4J approach, on the other hand,
Hi Doron:
Re: " comparing to all other IO ops and computations done by the stack of
scorers"
Lucene caches rather well and compresses well enough that the IO cache is
effective enough that you are not really paying for disk movement most of
the time. As for the stack of scores, that is actually m
I too tend to ignore the overhead of delegated calls, especially comparing
to all other IO ops and computations done by the stack of scorers, but
accepting that you cannot ignore it, could you achieve the same goal by
sub-classing the top query where you subclass its weight to return a
sub-class of
Wouldn't you get it as well with proposed api?
You would still be able to iterate the doc and at that point call score with
the docid. If you call score() along with iteration, you would still get the
information no?
Making scorer take a docid allows you score any docid in the reader if the
query w
To compute a score you have to see which of your subqueries did not
match, which did, and what are the docfreqs/positions for them.
When iterating, and calling score() only for current doc - parts of
this data (maybe even all of it, not sure) is already gathered for
you. If you allow calling score(
Shai:
Java cannot inline in this case.
Actually there is an urban legend around using final to hint to
underlying compiler to inline :) (turns out to be false, one reason being
dynamic classloading)
write a simple pgm and try and see for yourself (remember to turn on
-server on VM optio
What do you mean "we are not inlining"? The compiler inlines methods .. at
least it tries.
Shai
On Tue, Jun 8, 2010 at 8:21 PM, John Wang wrote:
> Shai:
>
> method call overhead in this case is not insignificant because it is in
> a very tight loop, and no, compiler cannot optimize it for y
Hi Earwin:
I am not sure I understand here, e.g. what si the difference between:
float myscorinCode(){
computeMyScore(scorer.score());
}
and
float myscorinCode(){
computeMyScore(scorer.score(scorer.getDocIdSetIterator().docID());
}
In the
Shai:
method call overhead in this case is not insignificant because it is in
a very tight loop, and no, compiler cannot optimize it for you, we are not
inline-ing cuz we are in a java world.
You are right, this breaks backward compatibility. But from 2.4 - 2.9,
we have done MUCH worse.
Yeah I got what he meant, but I honestly don't think those delegate calls
are an overhead ...
Shai
On Tue, Jun 8, 2010 at 8:12 PM, Earwin Burrfoot wrote:
> Shai, his wrapper Scorer will just look like:
> DISI getDISI() {
> return delegate.getDISI();
> }
>
> float score(int doc) {
> return cal
Shai, his wrapper Scorer will just look like:
DISI getDISI() {
return delegate.getDISI();
}
float score(int doc) {
return calcMyAwesomeScore(doc);
}
this saves delegate.nextDoc(), delegate.advance() indirection calls.
But I already offered a better alternative :)
On Tue, Jun 8, 2010 at 21:09
The problem with your proposal is that, currently, Lucene uses current
iteration state to compute score.
I.e. it already knows which of SHOULD BQ clauses matched for current
doc, so it's easier to calculate the score.
If you change API to allow scoring arbitrary documents (even those
that didn't ma
I guess I must be missing something fundamental here :).
If Scorer is defined as you propose, and I create my Scorer which impls
getDISI() as "return this" - what do I lose? What's wrong w/ Scorer already
being a DISI?
You mention "it is just inefficient to pay the method call overhead ..." -
wha
re: But Scorer is itself an iterator, so what prevents you from calling
nextDoc and advance on it without score()
Nothing. It is just inefficient to pay the method call overhead just to
overload score.
re: If I were in your shoes, I'd simply provider a Query wrapper. If CSQ
is not good enough I'd
Well … I don't know the reason as well and always thought Scorer and
Similarity are confusing.
But Scorer is itself an iterator, so what prevents you from calling
nextDoc and advance on it without score(). And what would the returned
DISI do when nextDoc is called, if not delegate to its subs?
If
Hi Shai:
I am not sure I understand how changing Similarity would solve this
problem, wouldn't you need the reader?
As for PayloadTermQuery, payload is not always the most efficient way of
storing such data, especially when number of terms << numdocs. (I am not
sure accessing the payload
So wouldn't it make sense to add some method to Similarity? Which receives
the doc Id in question maybe ... just thinking here.
Factoring Scorer like you propose would create 3 objects for
scoring/iterating: Scorer (which really becomes an iterator), Similarity and
CustomScoreFunction ...
Maybe y
Hi Shai:
Similarity in many cases is not sufficient for scoring. For example, to
implement age decaying of a document (very useful for corpuses like news or
tweets), you want to project the raw tfidf score onto a time curve, say
f(x), to do this, you'd have a custom scorer that decorates the u
I'm not sure I understand what you mean - Scorer is a DISI itself, and the
scoring formula is mostly controlled by Similarity.
What will be the benefits of the proposed change?
Shai
On Tue, Jun 8, 2010 at 8:25 AM, John Wang wrote:
> Hi guys:
>
> I'd like to make a proposal to change the Sc
Hi guys:
I'd like to make a proposal to change the Scorer class/api to the
following:
public abstract class Scorer{
DocIdSetIterator getDocIDSetIterator();
float score(int docid);
}
Reasons:
1) To build a Scorer from an existing Scorer (e.g. that produces raw scores
from tfidf), one
35 matches
Mail list logo