Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by MikeBrzozowski:
http://wiki.apache.org/nutch/MonitoringNutchCrawls

New page:
= Monitoring Nutch Crawls =

So you got Nutch all configured and turned it loose on your site, but your 
itchy trigger finger just needs to know how well it's working? Here are a 
couple ways you can keep an eye on your crawl:

== Monitoring network traffic ==

One way is to watch Nutch suck up your bandwidth as it crawls its way around. 
If you look at a graph of historical bandwidth usage, you should see it spike 
up and stay at a fairly consistent plateau, with valleys every so often each 
time a segment completes (since while Nutch is merging segments it doesn't use 
any bandwidth).

Some tools for this:
 * [http://www.ntop.org/overview.html ntop] (Linux, Windows) - A nifty program 
that gives you a Web-based history of your machine's bandwidth usage. You might 
get lucky and have it install easily... because the website isn't terribly 
helpful for install help.

== Monitoring fetch statistics ==

Of course, the bandwidth alone doesn't tell the whole story. How many pages are 
you retrieving? How many failed?

Here's a quick little shell script to do this; I'm sure people can improve on 
this--edit this page if so!

#!/bin/sh
echo "Monitoring nohup.out crawl progress..."
while :
do
  echo "Tried `grep 'fetching' nohup.out | wc -l` pages; `grep 'failed' 
nohup.out | wc -l` failed."
  sleep 60
done

=== To run this script: ===
 1. Save this script as something like monitorCrawl.sh
 2. Run your preferred crawl script with nohup, like this: nohup <nutch crawl 
command or script> &
 3. By default, this will output to nohup.out in the working directory. From 
the same directory, run: sh monitorCrawl.sh

This will give you minute-by-minute stats on how many pages nutch tried to 
fetch and how many failed with errors (e.g. 404, server unreachable).

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to