[ https://issues.apache.org/jira/browse/NUTCH-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-2020: ---------------------------------------- Summary: Establish Butch - the Continuous Benchmarking Evaluation for Nutch (was: Estalbish Butch - the Continuous Benchmarking Evaluation for Nutch) > Establish Butch - the Continuous Benchmarking Evaluation for Nutch > ------------------------------------------------------------------ > > Key: NUTCH-2020 > URL: https://issues.apache.org/jira/browse/NUTCH-2020 > Project: Nutch > Issue Type: New Feature > Components: deployment > Affects Versions: 2.4, 1.11 > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > > I would like to initiate something I've provisionally called BUTCH wit the > aim of providing a continuous benchmarking evaluation for Nutch. > I wrote a utility script called > [nipt](https://github.com/lewismc/nipt/blob/master/bootstrap.sh) which > essentially pulls the top 1M URL's from Alexa, does some simple reformatting > using sed and provides us with a flat file containing the top 1M URLs. > Loads of these are obviously porn (and god knows whatever else) related so I > would not advise injecting this garbage into any crawldb that you own or > administer. > I want to augment the [Benchmark > tool](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/Benchmark.java) > to imitate injecting the script and fetching the URLs. Essentially this > could run continuously with us sending results to the dev@ list or making > them available via some GUI. > The first step is for me to code this up. The second stage is for me to get > Apache Infra to provide us with some nice machines (courtesy of Rackspace) > which can host this for us. -- This message was sent by Atlassian JIRA (v6.3.4#6332)