Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by mozdevil: http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial The comment on the change is: Changed regex pattern and small script mistakes ------------------------------------------------------------------------------ To select links from the index and crawl for other pages there are a couple of nutch commands: generate, fetch and updatedb. The following bash script combines these, so that it can be started with just two parameters: the base directory of the data and the number of pages. Save this file as e.g. bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 10000 links from the index and fetches them. {{{ bin/nutch generate $1/crawldb $1/segments -topN $2 - segement=`bin/hadoop dfs -ls crawled01/segments/ tail -1 | grep -o [[:alnum:/]*` + segment=`bin/hadoop dfs -ls crawled01/segments/ | tail -1 | grep -o [a-zA-Z0-9/]*` bin/nutch fetch $segment bin/nutch updatedb $1/crawldb $segment }}} ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs