[Nutch-cvs] [Nutch Wiki] Update of "Getting Started" by SteveSeverance

Apache Wiki Thu, 29 Mar 2007 11:24:43 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by SteveSeverance:
http://wiki.apache.org/nutch/Getting_Started

------------------------------------------------------------------------------
   2. Then you need to submit your job to Hadoop to be run. This is done by 
calling JobClient.runJob. JobClient. runJob submits the job for starting and 
handles receiving status updates back from the job. It starts by creating an 
instance of the JobClient. It continues to push the job toward execution by 
calling JobClient.submitJob
   3. JobClient.submitJob handles splitting the input files and generating the 
MapReduce task.
  
+ === How are Nutch's Files layed out? ===
+ 
+ As a user you have some control over how files are layed out. I am going to 
show how a directory structure might work. The root for a crawl is /crawl. This 
directory can be called what ever you want and I will assume that it is in your 
nutch directory but it does not have to be. Remember that these directories 
really hold data for MapFile files. Mapfiles are layed out like <key,value>. 
Inside of /crawl there are several subdirectories:
+ 
+ /indexes - This directory holds the index that is generated by calling 
bin/nutch index. The directory must be called /indexes for the NutchBean to 
work.
+ /segments - When you generate a list of urls to crawl using bin/nutch 
generate a new segment is generated. The segments name is a time stamp. More on 
this in a minute.
+ /linkdb - This holds a list of pages and their links. This is used by the 
indexer to get incoming anchor text.
+ /crawldb - This holds a list of all URLs that Nutch knows about and their 
status. This includes any errors that occurred when retrieving the page.
+ 
+ The segment directory:
+ 
+ Inside of each segment there are the following directories:
+ 
+ /content
+ /crawl_fetch
+ /crawl_generate
+ /crawl_parse
+ /parse_data
+ /parse_text
+ 
+ These directories contain all the data for all the pages in each segment. 
Lets go through the directories.
+ 
+ /content
+ This directory contains the raw pages for each downloaded URL. Format <Url, 
Content>
+ 
+ /crawl_fetch
+ 
+ /crawl_generate
+ This directory contains a list of urls to download as part of the segment.
+ 
+ /crawl_parse
+ 
+ /parse_data
+ This directory contains metadata about a page. Format <Url,ParseData>
+ 
+ /parse_text
+ 
+ 
- === How do I open Nutch's data files ===
+ === How do I open Nutch's data files? ===
  You will need to interact with Nutch's files using Hadoop's MapFile and 
SequenceFile classes. This simple code sample shows opening a file and reading 
the values.
  
  {{{

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "Getting Started" by SteveSeverance

Reply via email to