Dalton, Jeffery wrote:
I would propose that even in crawling large web collections that the
updates may not always be proportional to the total size of the database
if you want to keep your index fresh. One of the goals of a web search
engine is to be an accurate representation of what is found on the web,
to maximize freshness. Several studies have shown that the web is very
dynamic with a subset of pages changing constantly -- weekly (15%),
daily or hourly (~23%), or even more often ("The Evolution of the Web
and the Implications for an Incremental Crawler" --
http://citeseer.ist.psu.edu/419993.html and "What's new on the web? The
evolution of the Web from a Search Engine Perspective --
http://www.www2004.org/proceedings/docs/1p1.pdf). In order to keep
dynamic pages fresh they must be crawled and indexed at a high
frequency. In short, you have to be able to update your database often,
say on a daily basis, to keep it fresh with important dynamic pages.
I don't follow your argument. If X% of pages change every day, that is
a rate proportional to the size of the entire collection, no?
It comes down to accessing at seek rate or transfer rate. Assume a
drive can seek in 10ms and can transfer at 10MB/s, and assume you've got
a 1B url collection stored on 10 drives. To update just 1% of the urls
at seek rate requires 10M seeks, or 1M seeks/drive, requiring around 3
hours. To update an arbitrary percentage at transfer rate (copying the
database, merging in changes), assuming 100 bytes per url, requires
100GB of transfer, or 10G/drive, requiring around 17 minutes. The
breakeven point is around 0.01%: only for less than 100k urls updated
out of 1B would seek-based updates be faster than transfer-based.
Doug
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers