Sean Timm wrote:
We too use Heritrix. We tried Nutch first but Nutch was not finding all
of the documents that it was supposed to. When Nutch and Heritrix were
both set to crawl our own site to a depth of three, Nutch missed some
pages that were linked directly from the seed. We ended up with 10%-20%
fewer pages in the Nutch crawl.

FWIW, from a private conversation with Sean it seems that this was likely related to the default configuration in Nutch, which collects only the first 1000 outlinks from a page. This is an arbitrary and configurable limit, introduced as a way to limit the impact of spam pages and to limit the size of LinkDb. If a page hits this limit then indeed the symptoms that you observe are missing (dropped) links.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to