I would like to setup a focussed crawler using Nutch. Assuming I have a way
to detect which page is relevant to the topic under consideration what is
the best architecture? Here are the constraints for the crawler.

(1) The first crawl should result in links/pages which are related to the
domain.
(2) The non-relevant documents in the database should be within N hops from
the relevant document. The assumption is a non-relevant document closer to
the relevant node in the web graph might contain links to other relevant
documents.

(1) should be fairly straight forward however I am not sure what it involves
to implement (2). I am also not sure how the PageRank will work in case of a
focussed crawler.

Any comments? suggestions?
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to