subject:"\"Handling large scale incremental PageRank updates\""

RE: Handling large scale incremental PageRank updates

2016-01-18 Thread Markus Jelsma

Hello - please see inline.
M.
 
-Original message-
> From:Otis Gospodnetić 
> Sent: Friday 15th January 2016 22:05
> To: Nutch User List 
> Subject: Handling large scale incremental PageRank updates
> 
> Hello,
> 
> We are working on a very large scale crawl (many billions of web pages)
> that needs to make use of link/page rank.  Because page rank for a page P
> changes as more links to page P are discovered, one really ought to
> periodically update the rank of the previously indexed page P.\

I don't think changing rank is going to be the big problem. Only at first will 
the graph quickly change but the scores are quite stable for long term crawls 
en recrawls. Also, if you intend to calculate linkrank frequently, you are 
going to need lots of hardware, it is CPU intense and it needs several runs for 
very large crawls.

> 
> This is not a problem for small crawls, but for large ones this is a
> problem if one tries to just reindex previously existing pages - reindexing
> is not cheap and if you've indexed hundreds of millions or billions of
> pages, reindexing them will take a long time and require a lot of resources.

Yes, but if you plan for large scale, your search engine is going to be large 
scale too right? And, are you not going to recrawl periodically? 

> 
> How do people normally handle that with Solr or Elasticsearch at large
> scale?
> 
> With Solr, do people stick the rank in the External File Field, for example?

Yes, you can do that. It is very efficient but you must take care of the 
sharding yourself. Solr won't take a big file and send it hashed to various 
shards.

> 
> With Elasticsearch, do people store pageID => pageRank info in an external
> store (e.g. Redis) and pull it from there to use when scoring search
> results?  Or maybe that, too, would be too slow when the number of matches
> is high?  Elasticsearch rescore to the rescue?

That should not be a problem. Solr can also do query reranking. If you can 
request a batch of URL scores via a single call, it should be quite efficient 
and would be the approach i would begin with.

> 
> Or are there better, more scalable ways to handle this?
> 
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/

Re: Handling large scale incremental PageRank updates

2016-01-16 Thread Dennis Kubes

When we were doing billion page crawls awhile back, in 2006-2008 we had 
the following setup.


1. Have a given number of shards to handle the full index, at that time
   this was 25 million pages per shard for 40 shards for a total of 1
   billion pages.
2. Crawl the pages for 1 shard.  Update the WebGraph and Linkrank as
   described here.  https://wiki.apache.org/nutch/NewScoring. Don't use
   Loops.  It was a bad program with a bad algorithm and I never should
   have put it in.  Live and learn.
3. Do the same for shards 2..n.  Each time updating.  Each crawl should
   get the highest ranked pages that haven't already been crawled
   within the recrawl interval.
4. Once you reach the maximum amount you can crawl.  Reset the crawl
   intervals for all documents are start over with shard 1 replacing
   the original shard index with the new one.

With this type of setup you will have possible duplicates and it is 
batch so you don't get the fast updates you might be looking for.  It 
should give you an increasingly better index as crawls continue and more 
links are added to the WebGraph.


Ways to improve this might be.

1. Change the algorithm for when pages get recrawled based on how often
   they change.  Would require determining change rate.
2. Move fast changing pages to a separate index and only reindex those
   after each shard run.  This fast index then just becomes another shard.
3. Many realtime or NRT search servers behind a partitioning
   algorithm.  Do a shard crawl, update the WebGraph and then reindex
   top X or Y most links changed.

This was all done using the Nutch SearchServer back when there was one.  
Not sure how that setup would translate to a solr or elasticsearch 
setup.  Hope this helps.


Dennis


On 01/15/2016 03:04 PM, Otis Gospodnetić wrote:

Hello,

We are working on a very large scale crawl (many billions of web pages)
that needs to make use of link/page rank.  Because page rank for a page P
changes as more links to page P are discovered, one really ought to
periodically update the rank of the previously indexed page P.

This is not a problem for small crawls, but for large ones this is a
problem if one tries to just reindex previously existing pages - reindexing
is not cheap and if you've indexed hundreds of millions or billions of
pages, reindexing them will take a long time and require a lot of resources.

How do people normally handle that with Solr or Elasticsearch at large
scale?

With Solr, do people stick the rank in the External File Field, for example?

With Elasticsearch, do people store pageID => pageRank info in an external
store (e.g. Redis) and pull it from there to use when scoring search
results?  Or maybe that, too, would be too slow when the number of matches
is high?  Elasticsearch rescore to the rescue?

Or are there better, more scalable ways to handle this?

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

Handling large scale incremental PageRank updates

2016-01-15 Thread Otis Gospodnetić

Hello,

We are working on a very large scale crawl (many billions of web pages)
that needs to make use of link/page rank.  Because page rank for a page P
changes as more links to page P are discovered, one really ought to
periodically update the rank of the previously indexed page P.

This is not a problem for small crawls, but for large ones this is a
problem if one tries to just reindex previously existing pages - reindexing
is not cheap and if you've indexed hundreds of millions or billions of
pages, reindexing them will take a long time and require a lot of resources.

How do people normally handle that with Solr or Elasticsearch at large
scale?

With Solr, do people stick the rank in the External File Field, for example?

With Elasticsearch, do people store pageID => pageRank info in an external
store (e.g. Redis) and pull it from there to use when scoring search
results?  Or maybe that, too, would be too slow when the number of matches
is high?  Elasticsearch rescore to the rescue?

Or are there better, more scalable ways to handle this?

Thanks,
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

RE: Handling large scale incremental PageRank updates

Re: Handling large scale incremental PageRank updates

Handling large scale incremental PageRank updates

3 matches

Site Navigation

Mail list logo

Footer information