Re: IndexSorter optimizer
Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category. What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept. A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first. Doug
Re: IndexSorter optimizer
Doug Cutting wrote: Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category. What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept. A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first. The WWW2005 conference presented a couple of interesting papers on the subject (http://www2005.org), among others these: 1. http://www2005.org/cdrom/docs/p235.pdf 2. http://www2005.org/cdrom/docs/p245.pdf 3. http://www2005.org/cdrom/docs/p257.pdf The techniques described in the first paper are not too difficult to implement, especially the Carmel's method of index pruning, which gives satisfactory results at moderate costs. The third paper, by Long Suel, presents a concept of using a cache of intersections for multi-term queries, which we already sort of use with CachingFilters, only they propose to store them on-disk instead of limiting the cache to relatively small number of filters kept in RAM... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) With this patch and a top result set in the xml file does that mean it will stop scanning the index at that point? Is there a methodology to actually prune the index on some scaling factor so that a 4 billion page index can be searchable only 1k results deep on average? seems like some sort of method to do the above would cut your search processing/index size down fairly well. But it may be a more expensive to post process to this scale then it is to simply push and let the query optimize ignore it as needed.. afterall disk space is getting rather cheap compared to cpu processing memory. --- Doug Cutting [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. Great news! I will submit the Lucene patches ASAP, now that we know they're useful. Doug
Re: IndexSorter optimizer
Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. Great news! I will submit the Lucene patches ASAP, now that we know they're useful. Doug
Re: IndexSorter optimizer
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low boost value (the OPIC score), to outrank pages with high boost but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as junk, e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars. Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but to a human judgement. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you? Please do it. There are two typos in your version of IndexSorter, you used numDocs() in two places instead of maxDoc(), which for indexes with deleted docs (after dedup) leads to exceptions. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Doug Cutting wrote: I have committed this, along with the LuceneQueryOptimizer changes. I could only find one place where I was using numDocs() instead of maxDoc(). Right, I confused two bugs from different files - the other bug still exists in the committed version of the LuceneQueryOptimizer.LimitedCollector constructor, instead of super(maxHits) it should be super(numHits) - this was actually the bug, which was causing that mysterious slowdown for higher values of MAX_HITS. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
IndexSorter optimizer
Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. The reason why result quality seems better is quite interesting, and it shows that the simple top-N measures that I was using in my benchmarks may have been too simplistic. Using the original index, it was possible for pages with high tf/idf of a term, but with a low boost value (the OPIC score), to outrank pages with high boost but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as junk, e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars. Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). This again brings to attention the importance of the OPIC score: it represents a query-independent opinion about the quality of the page - whichever way you calculate it. If you use PageRank, it (allegedly) corresponds to other people's opinions about the page, thus providing an objective quality opinion. If you use a simple list of white/black-listed sites that you like/dislike, then it represents your own subjective opinion on the quality of the site; etc, etc... In this way, running a search engine that provides good results is not just a plain precision, recall, tf/idf and other tangible measures, it's also a sort of political statement of the engine's operator. ;-) To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). As far we experiment it would be good to have booth. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D Great work! Cheers, Stefan
Re: IndexSorter optimizer
I've got 400mill db i can run this against over the next few days. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). As far we experiment it would be good to have booth. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D Great work! Cheers, Stefan
Re: IndexSorter optimizer
Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. This is very interesting. What's the computational complexity and disk I/O for index sorting as compared other operations on an index (e.g. adding/deleting N documents and running optimize)?
Re: IndexSorter optimizer
American Jeff Bowden wrote: Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. This is very interesting. What's the computational complexity and disk I/O for index sorting as compared other operations on an index (e.g. adding/deleting N documents and running optimize)? Comparable to optimize(). All index data needs to be read and copied, so the whole process is I/O bound. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com