Hi James,
A paper was mentioned on this list in the last couple of months which
presents a solution to your sampling problem without having to know the
total results size in advance. The paper
(http://www2005.org/cdrom/docs/p245.pdf) presents two solutions which
utilize a random variable. One solution has you traverse the result set
and select each document with probability p. P is determined in advance.
Alternately, the paper describes an algorithm (bottom of page 248) for
determining a skip value which, while similar to the traversal, allows you
to jump/skip over documents and save the probability computations for each
document required by the first solution.
I hope this helps!
Tricia
On Thu, 6 Jul 2006, James Pine wrote:
Hey,
Sorry, I will explain a bit more about my collect
method. Currently my collect method is executing
IndexSearcher.doc(id) and storing some stuff in a Map
which I can then retrieve from the HitCollector (much
like the example in the Lucene In Action book). Of
course that's somewhat expensive, so I'd like to do
some statistical sampling based on the result set size
to try and speed things up.
The way I was thinking about doing this was, during
the collect method only executing
IndexSearcher.doc(id) on every Nth document, where N
is calculated dynamically based on a minimum number X.
The rule would be:
N = Max(1,(numResults / X))
In order to do this in the collect method, I need to
know the total number of results before ever invoking
the collect method right? That seemed to make a case
for the BitSet/QueryFilter in the constructor.
In addition, someone else on the list mentioned that
one of the reasons calling IndexSearcher.doc(id) in
the collect method was that it caused the disk to do a
lot of seeking. Maybe that's a moot point if one is
using a RAMDirectory or an FSDirectory small enough
that it gets cached by the OS anyway, but if it's not,
then I thought it might be more performant to have the
hitcollector set the Bits in the collect method and
then do another pass to do the statistical sampling.
Either way it seems that to do the statistical
sampling that I envision I either need to calculate
the total result count/document id set in the
constructor, before calling the collect method, or
calculate the total result count/document id set in
the collect method and then execute some sort of
post-collect method, right? So I was just wondering
which method was better/faster. Thanx.
JAMES
--- Chris Hostetter <[EMAIL PROTECTED]> wrote:
: I'm using a HitCollector and would like to know
the
: total number of results that matched a given
query.
: Based on the JavaDoc, I this will do the trick:
you don't need a BitSet in that case, you could find
that out just using
an int...
public CountingCollector extends HitCollector {
public int count = 0;
public void collect(int doc, float score) {
count++ };
}
CountingCollector c = new CountingCollector();
searcher.search(query, c)
int numResults = c.count;
: If I want to know the total number of results
inside
: of the HitCollector, i.e. before the collect
method
: has ever been called, I think I could pass the
Query
: and Searcher objects into the HitCollector and do
this
: in its constructor:
:
: BitSet bits = (new
:
QueryFilter(query)).bits(searcher.getIndexReader());
: int numResults = bits.cardinality();
This question doesn't make a lot of sense to me, why
do you need to know
the total number ofresults before the collect method
is called? .. what
you are suggesting here (using QueryFilter in this
way) is perfectly
legal, but it's going to do just as much work as
using a HitCollector will
(possibly more, i can't remember).
: Is Lucene executing another pass over the index in
: order to populate the BitSet and then doing
another
: pass while calling the collect method? Thanx.
in your last example, you never us your
HitCollector, so i'm not sure what
you mean, but assuming you aresking about combining
those examples into
something like this....
Searcher searcher = new
IndexSearcher(indexReader);
BitSet bits = (new
QueryFilter(query)).bits(searcher.getIndexReader());
final int numResults = bits.cardinality();
searcher.search(query, new HitCollector() {
public void collect(int doc, float score) {
/* do something with numResults and doc
and score */
}
});
...then yes, you are most definitely making two
passes to do do that.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]