Hi,

I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better.

The reason why result quality seems better is quite interesting, and it shows that the simple top-N measures that I was using in my benchmarks may have been too simplistic.

Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars.

Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the "junk" pages with high tf/idf but low "boost". Since we collect up to N hits, going from higher to lower "boost" values, the "junk" pages with low "boost" value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent "boost" values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff "boost" value, and not the maximum number of hits (or maybe both?).

This again brings to attention the importance of the OPIC score: it represents a query-independent opinion about the quality of the page - whichever way you calculate it. If you use PageRank, it (allegedly) corresponds to other people's opinions about the page, thus providing an "objective" quality opinion. If you use a simple list of white/black-listed sites that you like/dislike, then it represents your own subjective opinion on the quality of the site; etc, etc... In this way, running a search engine that provides "good" results is not just a plain precision, recall, tf/idf and other tangible measures, it's also a sort of political statement of the engine's operator. ;-)

To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to