[ 
https://issues.apache.org/jira/browse/NUTCH-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2020:
----------------------------------------
    Summary: Establish Butch - the Continuous Benchmarking Evaluation for Nutch 
 (was: Estalbish Butch - the Continuous Benchmarking Evaluation for Nutch)

> Establish Butch - the Continuous Benchmarking Evaluation for Nutch
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2020
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2020
>             Project: Nutch
>          Issue Type: New Feature
>          Components: deployment
>    Affects Versions: 2.4, 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>
> I would like to initiate something I've provisionally called BUTCH wit the 
> aim of providing a continuous benchmarking evaluation for Nutch. 
> I wrote a utility script called 
> [nipt](https://github.com/lewismc/nipt/blob/master/bootstrap.sh) which 
> essentially pulls the top 1M URL's from Alexa, does some simple reformatting 
> using sed and provides us with a flat file containing the top 1M URLs.
> Loads of these are obviously porn (and god knows whatever else) related so I 
> would not advise injecting this garbage into any crawldb that you own or 
> administer.
> I want to augment the [Benchmark 
> tool](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/Benchmark.java)
>  to imitate injecting the script and fetching the URLs. Essentially this 
> could run continuously with us sending results to the dev@ list or making 
> them available via some GUI.
> The first step is for me to code this up. The second stage is for me to get 
> Apache Infra to provide us with some nice machines (courtesy of Rackspace) 
> which can host this for us. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to