Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by mozdevil:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

The comment on the change is:
Changed regex pattern and small script mistakes

------------------------------------------------------------------------------
  To select links from the index and crawl for other pages there are a couple 
of nutch commands: generate, fetch and updatedb. The following bash script 
combines these, so that it can be started with just two parameters: the base 
directory of the data and the number of pages. Save this file as e.g. 
bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 
10000 links from the index and fetches them. 
  {{{
  bin/nutch generate $1/crawldb $1/segments -topN $2
- segement=`bin/hadoop dfs -ls crawled01/segments/ tail -1 | grep -o 
[[:alnum:/]*`
+ segment=`bin/hadoop dfs -ls crawled01/segments/ | tail -1 | grep -o 
[a-zA-Z0-9/]*`
  bin/nutch fetch $segment
  bin/nutch updatedb $1/crawldb $segment
  }}}

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to