Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MikeBrzozowski: http://wiki.apache.org/nutch/MonitoringNutchCrawls ------------------------------------------------------------------------------ Of course, the bandwidth alone doesn't tell the whole story. How many pages are you retrieving? How many failed? Here's a quick little shell script to do this; I'm sure people can improve on this--edit this page if so! - + {{{ #!/bin/sh echo "Monitoring nohup.out crawl progress..." while : @@ -24, +24 @@ echo "Tried `grep 'fetching' nohup.out | wc -l` pages; `grep 'failed' nohup.out | wc -l` failed." sleep 60 done - + }}} === To run this script: === - 1. Save this script as something like monitorCrawl.sh + 1. Save this script as something like `monitorCrawl.sh` - 2. Run your preferred crawl script with nohup, like this: nohup <nutch crawl command or script> & + 2. Run your preferred crawl script with nohup, like this: `nohup <nutch crawl command or script> &` - 3. By default, this will output to nohup.out in the working directory. From the same directory, run: sh monitorCrawl.sh + 3. By default, this will output to nohup.out in the working directory. From the same directory, run: `sh monitorCrawl.sh` This will give you minute-by-minute stats on how many pages nutch tried to fetch and how many failed with errors (e.g. 404, server unreachable). ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs