RE: Handling large scale incremental PageRank updates
Hello - please see inline. M. -Original message- > From:Otis Gospodnetić > Sent: Friday 15th January 2016 22:05 > To: Nutch User List > Subject: Handling large scale incremental PageRank updates > > Hello, > > We are working on a very large scale crawl (many billions of web pages) > that needs to make use of link/page rank. Because page rank for a page P > changes as more links to page P are discovered, one really ought to > periodically update the rank of the previously indexed page P.\ I don't think changing rank is going to be the big problem. Only at first will the graph quickly change but the scores are quite stable for long term crawls en recrawls. Also, if you intend to calculate linkrank frequently, you are going to need lots of hardware, it is CPU intense and it needs several runs for very large crawls. > > This is not a problem for small crawls, but for large ones this is a > problem if one tries to just reindex previously existing pages - reindexing > is not cheap and if you've indexed hundreds of millions or billions of > pages, reindexing them will take a long time and require a lot of resources. Yes, but if you plan for large scale, your search engine is going to be large scale too right? And, are you not going to recrawl periodically? > > How do people normally handle that with Solr or Elasticsearch at large > scale? > > With Solr, do people stick the rank in the External File Field, for example? Yes, you can do that. It is very efficient but you must take care of the sharding yourself. Solr won't take a big file and send it hashed to various shards. > > With Elasticsearch, do people store pageID => pageRank info in an external > store (e.g. Redis) and pull it from there to use when scoring search > results? Or maybe that, too, would be too slow when the number of matches > is high? Elasticsearch rescore to the rescue? That should not be a problem. Solr can also do query reranking. If you can request a batch of URL scores via a single call, it should be quite efficient and would be the approach i would begin with. > > Or are there better, more scalable ways to handle this? > > Thanks, > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
Re: Handling large scale incremental PageRank updates
When we were doing billion page crawls awhile back, in 2006-2008 we had the following setup. 1. Have a given number of shards to handle the full index, at that time this was 25 million pages per shard for 40 shards for a total of 1 billion pages. 2. Crawl the pages for 1 shard. Update the WebGraph and Linkrank as described here. https://wiki.apache.org/nutch/NewScoring. Don't use Loops. It was a bad program with a bad algorithm and I never should have put it in. Live and learn. 3. Do the same for shards 2..n. Each time updating. Each crawl should get the highest ranked pages that haven't already been crawled within the recrawl interval. 4. Once you reach the maximum amount you can crawl. Reset the crawl intervals for all documents are start over with shard 1 replacing the original shard index with the new one. With this type of setup you will have possible duplicates and it is batch so you don't get the fast updates you might be looking for. It should give you an increasingly better index as crawls continue and more links are added to the WebGraph. Ways to improve this might be. 1. Change the algorithm for when pages get recrawled based on how often they change. Would require determining change rate. 2. Move fast changing pages to a separate index and only reindex those after each shard run. This fast index then just becomes another shard. 3. Many realtime or NRT search servers behind a partitioning algorithm. Do a shard crawl, update the WebGraph and then reindex top X or Y most links changed. This was all done using the Nutch SearchServer back when there was one. Not sure how that setup would translate to a solr or elasticsearch setup. Hope this helps. Dennis On 01/15/2016 03:04 PM, Otis Gospodnetić wrote: Hello, We are working on a very large scale crawl (many billions of web pages) that needs to make use of link/page rank. Because page rank for a page P changes as more links to page P are discovered, one really ought to periodically update the rank of the previously indexed page P. This is not a problem for small crawls, but for large ones this is a problem if one tries to just reindex previously existing pages - reindexing is not cheap and if you've indexed hundreds of millions or billions of pages, reindexing them will take a long time and require a lot of resources. How do people normally handle that with Solr or Elasticsearch at large scale? With Solr, do people stick the rank in the External File Field, for example? With Elasticsearch, do people store pageID => pageRank info in an external store (e.g. Redis) and pull it from there to use when scoring search results? Or maybe that, too, would be too slow when the number of matches is high? Elasticsearch rescore to the rescue? Or are there better, more scalable ways to handle this? Thanks, Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/
Handling large scale incremental PageRank updates
Hello, We are working on a very large scale crawl (many billions of web pages) that needs to make use of link/page rank. Because page rank for a page P changes as more links to page P are discovered, one really ought to periodically update the rank of the previously indexed page P. This is not a problem for small crawls, but for large ones this is a problem if one tries to just reindex previously existing pages - reindexing is not cheap and if you've indexed hundreds of millions or billions of pages, reindexing them will take a long time and require a lot of resources. How do people normally handle that with Solr or Elasticsearch at large scale? With Solr, do people stick the rank in the External File Field, for example? With Elasticsearch, do people store pageID => pageRank info in an external store (e.g. Redis) and pull it from there to use when scoring search results? Or maybe that, too, would be too slow when the number of matches is high? Elasticsearch rescore to the rescue? Or are there better, more scalable ways to handle this? Thanks, Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/