crawling simulation
-------------------

                 Key: NUTCH-357
                 URL: http://issues.apache.org/jira/browse/NUTCH-357
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.8.1, 0.9.0
            Reporter: Stefan Groschupf
             Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  "web servers". 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to