[ 
https://issues.apache.org/jira/browse/NUTCH-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671133#action_12671133
 ] 

Andrzej Bialecki  commented on NUTCH-357:
-----------------------------------------

Closing this issue - the suggested solution seems to address the problem in a 
sufficient way.

> crawling simulation
> -------------------
>
>                 Key: NUTCH-357
>                 URL: https://issues.apache.org/jira/browse/NUTCH-357
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. 
> Reproducing these problems is a kind of difficult, since first of all it is 
> not polite to re-crawl a set of pages again and again, secondly it is 
> difficult to catch the page that cause a problem. 
> Therefore it would be very useful to have a testbed to simulate crawls where  
> we can control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it 
> self,  link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC 
> or a webgraph would be much more interesting, for instance to caculate the 
> quality of the nutch OPIC implementation against page rank scores of the 
> webgraph or evaluaing crawling strategies.    

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to