[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Apache Wiki Fri, 23 Feb 2007 07:07:09 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by mozdevil:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

------------------------------------------------------------------------------
  bin/nutch server 9999 ${SEARCH_INSTALL_DIR}/local/crawled01
  }}}
  
+ == Crawling more pages ==
+ To select links from the index and crawl for other pages there are a couple 
of nutch commands: generate, fetch and updatedb. The following bash script 
combines these, so that it can be started with just two parameters: the base 
directory of the data and the number of pages. Save this file as e.g. 
bin/fetch, if the data is in crawled01 than `bin/fetch crawled01 10000' selects 
10000 links from the index and fetches them. 
+ {{{
+ bin/nutch generate $1/crawldb $1/segments -topN $2
+ segement=`bin/hadoop dfs -ls crawled01/segments/ tail -1 | grep -o 
[[:alnum:/]*`
+ bin/nutch fetch $segment
+ bin/nutch updatedb $1/crawldb $segment
+ }}}
+ 
+ To build a new index use the following script:
+ {{{
+ bin/hadoop dfs -rmr $1/indexes
+ bin/nutch invertlinks $1/linkdb $1/segments/*
+ bin/nutch index $1/indexes $1/crawldb $1/linkdb $1/segments/*
+ }}}
+ 
+ Copy the data to local and searching can be done on the new data.
+   
+ 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "Nutch0.9-Hadoop0.10-Tutorial" by mozdevil

Reply via email to