Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MikeBrzozowski: http://wiki.apache.org/nutch/MonitoringNutchCrawls New page: = Monitoring Nutch Crawls = So you got Nutch all configured and turned it loose on your site, but your itchy trigger finger just needs to know how well it's working? Here are a couple ways you can keep an eye on your crawl: == Monitoring network traffic == One way is to watch Nutch suck up your bandwidth as it crawls its way around. If you look at a graph of historical bandwidth usage, you should see it spike up and stay at a fairly consistent plateau, with valleys every so often each time a segment completes (since while Nutch is merging segments it doesn't use any bandwidth). Some tools for this: * [http://www.ntop.org/overview.html ntop] (Linux, Windows) - A nifty program that gives you a Web-based history of your machine's bandwidth usage. You might get lucky and have it install easily... because the website isn't terribly helpful for install help. == Monitoring fetch statistics == Of course, the bandwidth alone doesn't tell the whole story. How many pages are you retrieving? How many failed? Here's a quick little shell script to do this; I'm sure people can improve on this--edit this page if so! #!/bin/sh echo "Monitoring nohup.out crawl progress..." while : do echo "Tried `grep 'fetching' nohup.out | wc -l` pages; `grep 'failed' nohup.out | wc -l` failed." sleep 60 done === To run this script: === 1. Save this script as something like monitorCrawl.sh 2. Run your preferred crawl script with nohup, like this: nohup <nutch crawl command or script> & 3. By default, this will output to nohup.out in the working directory. From the same directory, run: sh monitorCrawl.sh This will give you minute-by-minute stats on how many pages nutch tried to fetch and how many failed with errors (e.g. 404, server unreachable). ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs