Doug Cutting wrote:
Andrzej Bialecki wrote:
I tested it on a 5 mln index.
Thanks, this is great data!
Can you please tell a bit more about the experiments? In particular:
. How were scores assigned to pages? Link analysis? log(number of
incoming links) or OPIC?
log()
. How were
Andrzej Bialecki wrote:
. How were the queries generated? From a log or randomly?
Queries have been picked up manually, to test the worst performing cases
from a real query log.
So, for example, the 50% error rate might not be typical, but could be
worst-case.
. When results differed
Doug Cutting wrote:
Andrzej Bialecki wrote:
. How were the queries generated? From a log or randomly?
Queries have been picked up manually, to test the worst performing
cases from a real query log.
So, for example, the 50% error rate might not be typical, but could be
worst-case.
Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more
Andrzej Bialecki wrote:
I'll test it soon - one comment, though. Currently you use a subclass of
RuntimeException to stop the collecting. I think we should come up with
a better mechanism - throwing exceptions is too costly.
I thought about this, but I could not see a simple way to achieve
Doug Cutting wrote:
Andrzej Bialecki wrote:
Ok, I just tested IndexSorter for now. It appears to work correctly,
at least I get exactly the same results, with the same scores and the
same explanations, if I run the smae queries on the original and on
the sorted index.
Here's a more
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting list...
Yes. I was just posting the work-in-progress.
We will also need to estimate the total number of matches by
extrapolating
Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting
list...
Yes. I was just posting the work-in-progress.
Ok, I just tested IndexSorter for now. It appears
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge
of Lucene... ;-)
Okay, I'll try to get something working fairly soon.
Doug
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my knowledge
of Lucene... ;-)
Attached is a class which sorts a Nutch index by boost. I have only
tested it on a ~100 page index, where it appears to work correctly.
Please tell me how it works for you.
Doug Cutting wrote:
Andrzej Bialecki wrote:
By all means please start, this is still near the limits of my
knowledge of Lucene... ;-)
Attached is a class which sorts a Nutch index by boost. I have only
tested it on a ~100 page index, where it appears to work correctly.
Please tell me
11 matches
Mail list logo