I've got 400mill db i can run this against over the next few days. -byron
--- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Andrzej, > > wow are really great news! > > Using the optimized index, I reported previously > that some of the > > top-scoring results were missing. As it happens, > the missing > > results were typically the "junk" pages with high > tf/idf but low > > "boost". Since we collect up to N hits, going from > higher to lower > > "boost" values, the "junk" pages with low "boost" > value were > > automatically eliminated. So, overall the > subjective quality of > > results was improved. On the other hand, some of > the legitimate > > results with a decent "boost" values were also > skipped because they > > didn't fit within the fixed number of hits... ah, > well. Perhaps we > > should limit the number of hits in > LimitedCollector using a cutoff > > "boost" value, and not the maximum number of hits > (or maybe both?). > > As far we experiment it would be good to have booth. > > > To conclude, I will add the IndexSorter.java to > the core classes, > > and I suggest to continue the experiments ... > > May someone out there in the community has a > commercial search engine > running (e.g. google appliance or similar) so we may > can setup a > nutch with the same pages and compare the results. > I guess it will be difficult to compare nutch with > yahoo or google > since nobody of us has a 4 billion index up and > running. I would run > one on my laptop but I do not have the bandwidth to > fetch until next > two days. :-D > Great work! > > Cheers, > Stefan >