Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I tested it on a 5 mln index. Thanks, this is great data! Can you please tell a bit more about the experiments? In particular: . How were scores assigned to pages? Link analysis? log(number of incoming links) or OPIC? log() . How were

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Doug Cutting
Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case. . When results differed

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting
Andrzej Bialecki wrote: I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. I thought about this, but I could not see a simple way to achieve

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Doug Cutting
Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. We will also need to estimate the total number of matches by extrapolating

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. Ok, I just tested IndexSorter for now. It appears

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Okay, I'll try to get something working fairly soon. Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Doug Cutting
Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me how it works for you.

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me