Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by mozdevil: http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial ------------------------------------------------------------------------------ bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01 }}} + == Crawling more pages == + To select links from the index and crawl for other pages there are a couple of nutch commands: generate, fetch and updatedb. The following bash script combines these, so that it can be started with just two parameters: the base directory of the data and the number of pages. Save this file as e.g. bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 10000 links from the index and fetches them. + {{{ + bin/nutch generate $1/crawldb $1/segments -topN $2 + segement=`bin/hadoop dfs -ls crawled01/segments/ tail -1 | grep -o [[:alnum:/]*` + bin/nutch fetch $segment + bin/nutch updatedb $1/crawldb $segment + }}} + + To build a new index use the following script: + {{{ + bin/hadoop dfs -rmr $1/indexes + bin/nutch invertlinks $1/linkdb $1/segments/* + bin/nutch index $1/indexes $1/crawldb $1/linkdb $1/segments/* + }}} + + Copy the data to local and searching can be done on the new data. + + ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs