crawling simulation ------------------- Key: NUTCH-357 URL: http://issues.apache.org/jira/browse/NUTCH-357 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Fix For: 0.9.0
We recently discovered some serious issue related to crawling and scoring. Reproducing these problems is a kind of difficult, since first of all it is not polite to re-crawl a set of pages again and again, secondly it is difficult to catch the page that cause a problem. Therefore it would be very useful to have a testbed to simulate crawls where we can control the response of "web servers". For the very beginning simulate very basic situation like a page points to it self, link chains or internal links would already be very usefully. However later on simulate crawls against existing data collections like TREC or a webgraph would be much more interesting, for instance to caculate the quality of the nutch OPIC implementation against page rank scores of the webgraph or evaluaing crawling strategies. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira